Cluster Monitoring and Health Checking


With Bright Cluster Manager a comprehensive set of hardware and software and job metrics can be monitored, visualized and analyzed in a variety of ways. Virtually all software and hardware metrics available to the Linux kernel and all hardware metrics available to hardware management interfaces - such as IPMI - are available.

Available Metrics


The metrics available by default on an cluster can be categorized into two main categories:

  • Cluster Metrics — These are metrics for the cluster as a whole, often summed or averaged over all regular nodes.
  • Device Metrics — These are metrics for one individual node, such as a compute node, provisioning node, login node, or another type of node or device.
  • Job Metrics — These are metrics for one individual batch or Kubernetes job. Job metrics are accurate even when multiple jobs are running on the same node because they are sampled from the jobs cgroup container.

For each of the above categories, the following subcategories are available:

  • CPU — Examples of metrics: speed, idle time, user time, system time, wait time.
  • Disk — Examples of metrics: free space, used space, I/O performance, SMART data.
  • Docker — The results of the Docker health checks.
  • Etcd — The results of the Etcd health checks.
  • GPU — Examples of metrics: power consumption, temperature, utilization.
  • Job — The number of running jobs.
  • Kubernetes — The results of the Kubernetes health checks.
  • Memory — Examples of metrics: free memory, used memory, free swap space, used swap space, buffer memory, cache memory.
  • Network — Examples of metrics: bytes sent/received, IP/TCP/UDP errors.
  • Operating System — Examples of metrics: forks, load average, process count, running processes, uptime.
  • Process — CMDaemon metrics: memory used, system time, threads used, virtual memory used.
  • Workload — Examples of metrics: running jobs, queued jobs, failed jobs, completed jobs, estimated delay, average job duration, average expansion factor.


Workload Accounting and Reporting

7-1Bright Cluster Manager’s Workload Accounting and Reporting (WAR) capability combines device and job metrics to give administrators, managers, and users the information they need to use HPC system resources effectively, to maximize system productivity, to enable effective resource sharing, to identify waste and to provide chargeback capability.



Custom Metrics

In addition to the default metrics, you can easily add custom metrics for monitoring by using a custom data producer script. This is a very simple script that captures a value and presents it in a consistent format to Bright Cluster Manager. Examples of custom metrics include values that can be read from an application or from a device such as a UPS, storage unit, firewall device, tape robot, SAN switch or KVM switch. Other interesting examples include metrics from scientific instruments connected to the cluster, such as a microscope, a telescope or a genome sequencer.


Visualization with Graphs

8-1Many features of the graphs can be customized. For example, graph line color and style, graph filling color and style, and graph transparency can all be configured.

All available metrics can be visualized using graphs. In the monitoring visualization window, multiple graphs can be shown simultaneously. A new graph is created by simply dragging a metric from the metrics tree into an empty graph area. Metrics can also be dragged into existing graph areas to allow for visual comparison between multiple metrics.

You can easily zoom in and out of graphs by dragging your mouse over an area of the graph. The monitoring system will then retrieve the required data automatically to rebuild the graph at a smaller or larger scale. Many features of the graphs can be customized. For example, graph line color and style, graph filling color and style, and graph transparency can all be configured.

All configurations of the monitoring visualization window can be saved for future use. So if you have built up an 8 x 6 matrix of 48 different graphs — each with its own customized color scheme — you can save this configuration and load it quickly later.


Visualization with the Rack View

The Rack View shows the rack layout of the cluster, with optionally one or two metrics displayed per node using a color scale.

All available metrics can also be visualized in the Rack View. The Rack View shows the rack layout of the cluster, with optionally one or two metrics displayed per node using a color scale.

If the order and size of the nodes, switches and other devices in the cluster are known to Bright Cluster Manager, they will be used to build the rack layout in the Rack View. Otherwise, the nodes and switches will be shown at equal size and in alphabetical order.

For clusters with many racks, the "zoom out" feature allows you to see the metric values in many racks simultaneously as a color map.

The Rack View is a very useful tool for visualizing what is going on in your cluster. For example, if you show CPU or system temperatures in the Rack View, you can immediately see if some parts of your cluster are running hotter than other parts. You can also use the Rack View to show two metrics simultaneously to see if they are correlated. For example, fan speeds and CPU temperatures will often show some level of correlation.


Configuration of the Monitoring System

The Bright monitoring system is fully configurable to match your needs and preferences. Some examples of configurable settings include:

  • Which default and custom metrics to monitor. For example, you can stop certain metrics from being sampled, but you can also just stop metrics from being stored. The latter means that you are saving on storage while you are still able to visualize 'current' values. You can also still define thresholds on metrics you are not storing.
  • How often to sample each metric. For example, you may want to sample CPU temperature values every minute, but fan speed values only every 10 minutes.
  • How long to keep metrics data. For example, you may not be interested in disk performance metrics older than 3 months, but you may be interested in cluster load values over the lifetime of the cluster.
  • How to consolidate each metric over time. For example, you may wish to keep used swap space values of nodes in the node category "large memory nodes" over the lifetime of the system, whereby values of the last 30 days should not be consolidated, but values older than 30 days may be averaged per hour, and values older than 90 days may be averaged per day.


Monitoring Architecture

All monitoring data is either sampled locally by the cluster management daemon (CMDaemon) on each regular and head node, or it is sampled directly from the BMC through the IPMI or iLO interface. In both cases, sampling is optimized for minimal resource consumption. For example, the CMDaemon samples all metrics in one process, without forking additional processes, whereas sampling through the IPMI or iLO interface happens out-of-band.

The CMDaemon on the head node periodically collects the data from the CMDaemons on the other nodes and stores it as raw data in the raw database hosted on the head node. The data is subsequently consolidated into the consolidated database, which is also hosted on the head node.

When the cluster management GUI generates a graph or a Rack View, it requests the required data from the CMDaemon on the head node, which reads it from the consolidated database.