By Martijn de Vries | February 22, 2016 |
Most HPC users conduct device-centered monitoring to keep track of the health of their cluster nodes. They select the device in question and a metric they are interested in, and can then see the results over time depicted with a chart or graph, say for example the temperature of the CPU for node 5 of a cluster.
That is useful for administrators, but users care more about the health and resources usage of their specific job than the devices they run on. That’s why we’ve added job-based metrics to our Bright Cluster Manager solutions. Users can now select a job that is currently running (or has recently run) and get metrics for both nodes and jobs. For example, they can plot memory consumption for a particular job. These screenshots illustrate the effectiveness in visualizing monitoring data on a device basis rather than a job basis:
This table lists some of the most useful job metrics you can monitor:
Block device usages metrics (storage usage, or I/O) for each device installed on the node |
|
blkio.time |
Time that job had I/O access to specific device |
blkio.sectors |
Number of sectors transferred to or from specific devices by a cgroup |
blkio.io_serviced_read blkio.io_serviced_write blkio.io_serviced_sync blkio.io_serviced_async |
Number of I/O operations (for each of the operation type) performed on specific devices |
blkio.io_service_read blkio.io_service_write blkio.io_service_sync blkio.io_service_async |
Number of bytes transferred to or from specific devices (for each of the operation types) |
CPU usage metrics |
|
cpuacct.usage |
Total CPU time consumed by all processes of the job |
cpuacct.stat.user |
User CPU time consumed by all processes of the job |
cpuacct.stat.system |
System CPU time consumed by all processes of the job |
Memory usage metrics |
|
memory.usage |
Total current memory usage |
memory.memsw.usage |
Sum of current memory plus swap space usage |
memory.memsw.max_usage |
Maximum amount of memory and swap space used |
memory.failcnt |
Number of times that the memory limit has reached the value set in memory.limit_in_bytes |
memory.memsw.failcnt |
Number of times that the memory plus swap space limit has reached the value set in memory.memsw.limit_in_bytes |
memory.stat.swap |
Total swap usage |
memory.stat.cache |
Total page cache, including tmpfs (shmem) |
memory.stat.mapped_file |
Size of memory-mapped mapped files, including tmpfs (shmem) |
memory.stat.unevictable |
Memory that cannot be reclaimed |
Of course we continue to collect all of the node-based metrics you have come to expect of Bright. Things like CPU idle time, CPU temperature, free memory available on a node, the time since the node was last booted, and more, are all tracked and available for you to see. By adding job-based metrics we’re adding a new dimension to the way you monitor your clusters. We encourage you to try them out and let us know what you think.