Job-based metrics help users visualize information on what is most important to them


By Martijn de Vries | February 22, 2016 |



Most HPC users conduct device-centered monitoring to keep track of the health of their cluster nodes. They select the device in question and a metric they are interested in, and can then see the results over time depicted with a chart or graph, say for example the temperature of the CPU for node 5 of a cluster.

That is useful for administrators, but users care more about the health and resources usage of their specific job than the devices they run on. That’s why we’ve added job-based metrics to our Bright Cluster Manager solutions. Users can now select a job that is currently running (or has recently run) and get metrics for both nodes and jobs. For example, they can plot memory consumption for a particular job. These screenshots illustrate the effectiveness in visualizing monitoring data on a device basis rather than a job basis:




This table lists some of the most useful job metrics you can monitor:

Block device usages metrics (storage usage, or I/O) for each device installed on the node


Time that job had I/O access to specific device


Number of sectors transferred to or from specific devices by a cgroup

blkio.io_serviced_read   blkio.io_serviced_write  blkio.io_serviced_sync  blkio.io_serviced_async

Number of I/O operations (for each of the operation type) performed on specific devices

blkio.io_service_read blkio.io_service_write blkio.io_service_sync blkio.io_service_async

Number of bytes transferred to or from specific devices (for each of the operation types)

CPU usage metrics


Total CPU time consumed by all processes of the job


User CPU time consumed by all processes of the job


System CPU time consumed by all processes of the job

Memory usage metrics


Total current memory usage


Sum of current memory plus swap space usage


Maximum amount of memory and swap space used


Number of times that the memory limit has reached the value set in memory.limit_in_bytes


Number of times that the memory plus swap space limit has reached the value set in memory.memsw.limit_in_bytes


Total swap usage


Total page cache, including tmpfs (shmem)


Size of memory-mapped mapped files, including tmpfs (shmem)


Memory that cannot be reclaimed

Of course we continue to collect all of the node-based metrics you have come to expect of Bright. Things like CPU idle time, CPU temperature, free memory available on a node, the time since the node was last booted, and more, are all tracked and available for you to see. By adding job-based metrics we’re adding a new dimension to the way you monitor your clusters. We encourage you to try them out and let us know what you think.

High Performance Computing eBook