Cluster Health Management
During a cluster's life-time, there can be many software and hardware health issues, ranging from teething problems to wear and tear.
Bright Cluster Manager® includes Cluster Health Management — a set of powerful functions that help to keep your cluster healthy.
The functionality was developed over many years and is based on our extensive experience in deploying, managing and using large and complex clusters, including many TOP500 clusters.
Why You Need Cluster Health Management
An HPC cluster is not a static system — it evolves over time as hardware and software are subject to change, especially when used intensively:
- Hardware is sensitive to environmental conditions, such as temperature and humidity. When your supplier delivers your new cluster to your site, its components have been exposed to many different environments as they moved through the supply chain. This causes stress and exposes faults.
- A cluster contains many hardware parts that are subject to wear and tear, in particular moving parts such as fans and disks.
- Filesystems can get corrupted.
- User applications can spawn processes that are not killed after the job has finished.
- Software packages can go out-of-date and become incompatible with newer versions of other packages.
- Users and system administrators may leave old software in places where they affect the system.
Cluster Management Automation
Cluster management automation takes preemptive actions when predetermined system thresholds are exceeded, saving time and preventing hardware damage. System thresholds can be configured on any of the available metrics. Cluster management automation is a powerful tool in cluster health management. For example, a temperature threshold for GPUs can be established that results in the system automatically shutting down an overheated GPU unit and sending an SMS message to the system administrator's mobile phone. Cluster management automation is described in detail here.
An essential element in Bright Cluster Health Management is the health check. A health check is an action that returns one of three possible health states: PASS, FAIL or UNKNOWN. A health check has a settable severity associated with a FAIL or UNKOWN response, as well as a settable message. A health check can also launch an action based on any of the response values, similar to the way an action is launched by a metric with a threshold condition.
Some arbitrary examples of health checks are:
- check if the hard drive still has enough space left on it and return PASS if it has;
- check if an NFS mount is accessible and return FAIL if it is not;
- check if CPUUser is below 50% and return PASS if it is;
- check if the cmsh binary is found and return UNKNOWN if it is not.
Prejob Health Checks
The prejob health check is a special type of health check that is run just before a job is executed. It is the most powerful medicine against the so-called Black Hole Node Syndrome, where one or more unhealthy nodes crash every job assigned to them. A prejob health check instructs the workload manager to hold the job briefly to allow the health check to be run on the nodes reserved by the workload manager. If the health check fails on any of the reserved nodes, the predefined action(s) will be taken. A common action sequence to run when a prejob health check fails on a node would be:
- instruct the workload manager to reschedule the job to a different node;
- give the node a status of "Drained", which means no more jobs will be scheduled to it;
- inform the system administrator.
The burn-in test puts one or more nodes through a set of predefined stress and performance tests aimed at exposing hardware faults. It also measures performance metrics that can indicate latent or imminent hardware faults. The burn-in test can be run at different levels of intensity. The most intense burn-in test is accessed by rebooting a node into burn-in mode. This will force multiple reboots and wipes all data off the hard disk. Less intense burn-in tests that do not require reboots or wiping of hard disks can be scheduled to run through the workload manager at set intervals.
Examples of stress and performance tests that can be part of a burn-in test:
- Measure hard disk write speed with CPUs idle.
- Measure hard disk write speed while stressing CPUs.
- Check for bad blocks while running memory test.
- Test power supply with hard power resets.
- Check for bad blocks while running Linpack on all CPUs.
- Compile Linux kernel.
- Run mprime torture test.
- Run memtest86.
Video: Cluster Health
“The centralized status and health information database simplifies troubleshooting and reduces service disturbances.”
— Erik Engquist, Systems Administrator at the University of Houston