Job-based metrics help users visualize information on what is most important to them

page_header_divider_line

By Martijn de Vries | February 22, 2016 |

   

 

Most HPC users conduct device-centered monitoring to keep track of the health of their cluster nodes. They select the device in question and a metric they are interested in, and can then see the results over time depicted with a chart or graph, say for example the temperature of the CPU for node 5 of a cluster.

That is useful for administrators, but users care more about the health and resources usage of their specific job than the devices they run on. That’s why we’ve added job-based metrics to our Bright Cluster Manager solutions. Users can now select a job that is currently running (or has recently run) and get metrics for both nodes and jobs. For example, they can plot memory consumption for a particular job. These screenshots illustrate the effectiveness in visualizing monitoring data on a device basis rather than a job basis:

s1-jobmetrics.png

s2-jobmetrics.png

s3-jobmetrics.png

This table lists some of the most useful job metrics you can monitor:

Block device usages metrics (storage usage, or I/O) for each device installed on the node

blkio.time

Time that job had I/O access to specific device

blkio.sectors

Number of sectors transferred to or from specific devices by a cgroup

blkio.io_serviced_read   blkio.io_serviced_write  blkio.io_serviced_sync  blkio.io_serviced_async

Number of I/O operations (for each of the operation type) performed on specific devices

blkio.io_service_read blkio.io_service_write blkio.io_service_sync blkio.io_service_async

Number of bytes transferred to or from specific devices (for each of the operation types)

CPU usage metrics

cpuacct.usage

Total CPU time consumed by all processes of the job

cpuacct.stat.user

User CPU time consumed by all processes of the job

cpuacct.stat.system

System CPU time consumed by all processes of the job

Memory usage metrics

memory.usage

Total current memory usage

memory.memsw.usage

Sum of current memory plus swap space usage

memory.memsw.max_usage

Maximum amount of memory and swap space used

memory.failcnt

Number of times that the memory limit has reached the value set in memory.limit_in_bytes

memory.memsw.failcnt

Number of times that the memory plus swap space limit has reached the value set in memory.memsw.limit_in_bytes

memory.stat.swap

Total swap usage

memory.stat.cache

Total page cache, including tmpfs (shmem)

memory.stat.mapped_file

Size of memory-mapped mapped files, including tmpfs (shmem)

memory.stat.unevictable

Memory that cannot be reclaimed


Of course we continue to collect all of the node-based metrics you have come to expect of Bright. Things like CPU idle time, CPU temperature, free memory available on a node, the time since the node was last booted, and more, are all tracked and available for you to see. By adding job-based metrics we’re adding a new dimension to the way you monitor your clusters. We encourage you to try them out and let us know what you think.

High Performance Computing eBook

resource_asset_divider_image

COMMENTS