With Bright Cluster Manager a comprehensive set of hardware and software and job metrics can be monitored, visualized and analyzed in a variety of ways. Virtually all software and hardware metrics available to the Linux kernel and all hardware metrics available to hardware management interfaces - such as IPMI - are available.
The metrics available by default on an cluster can be categorized into two main categories:
For each of the above categories, the following subcategories are available:
Bright Cluster Manager’s Workload Accounting and Reporting (WAR) capability combines device and job metrics to give administrators, managers, and users the information they need to use HPC system resources effectively, to maximize system productivity, to enable effective resource sharing, to identify waste and to provide chargeback capability.
In addition to the default metrics, you can easily add custom metrics for monitoring by using a custom data producer script. This is a very simple script that captures a value and presents it in a consistent format to Bright Cluster Manager. Examples of custom metrics include values that can be read from an application or from a device such as a UPS, storage unit, firewall device, tape robot, SAN switch or KVM switch. Other interesting examples include metrics from scientific instruments connected to the cluster, such as a microscope, a telescope or a genome sequencer.
Many features of the graphs can be customized. For example, graph line color and style, graph filling color and style, and graph transparency can all be configured.
All available metrics can be visualized using graphs. In the monitoring visualization window, multiple graphs can be shown simultaneously. A new graph is created by simply dragging a metric from the metrics tree into an empty graph area. Metrics can also be dragged into existing graph areas to allow for visual comparison between multiple metrics.
You can easily zoom in and out of graphs by dragging your mouse over an area of the graph. The monitoring system will then retrieve the required data automatically to rebuild the graph at a smaller or larger scale. Many features of the graphs can be customized. For example, graph line color and style, graph filling color and style, and graph transparency can all be configured.
All configurations of the monitoring visualization window can be saved for future use. So if you have built up an 8 x 6 matrix of 48 different graphs — each with its own customized color scheme — you can save this configuration and load it quickly later.
The Rack View shows the rack layout of the cluster, with optionally one or two metrics displayed per node using a color scale.
All available metrics can also be visualized in the Rack View. The Rack View shows the rack layout of the cluster, with optionally one or two metrics displayed per node using a color scale.
If the order and size of the nodes, switches and other devices in the cluster are known to Bright Cluster Manager, they will be used to build the rack layout in the Rack View. Otherwise, the nodes and switches will be shown at equal size and in alphabetical order.
For clusters with many racks, the "zoom out" feature allows you to see the metric values in many racks simultaneously as a color map.
The Rack View is a very useful tool for visualizing what is going on in your cluster. For example, if you show CPU or system temperatures in the Rack View, you can immediately see if some parts of your cluster are running hotter than other parts. You can also use the Rack View to show two metrics simultaneously to see if they are correlated. For example, fan speeds and CPU temperatures will often show some level of correlation.
The Bright monitoring system is fully configurable to match your needs and preferences. Some examples of configurable settings include:
All monitoring data is either sampled locally by the cluster management daemon (CMDaemon) on each regular and head node, or it is sampled directly from the BMC through the IPMI or iLO interface. In both cases, sampling is optimized for minimal resource consumption. For example, the CMDaemon samples all metrics in one process, without forking additional processes, whereas sampling through the IPMI or iLO interface happens out-of-band.
The CMDaemon on the head node periodically collects the data from the CMDaemons on the other nodes and stores it as raw data in the raw database hosted on the head node. The data is subsequently consolidated into the consolidated database, which is also hosted on the head node.
When the cluster management GUI generates a graph or a Rack View, it requests the required data from the CMDaemon on the head node, which reads it from the consolidated database.