Cluster monitoring vs. health checking: What’s the difference?


If you are responsible for managing a cluster, you certainly use monitoring software to help you keep it running right. Many organizations, however, tend to lump cluster monitoring with cluster health checks, as if they were one and the same, interchangeable. They’re not.

One way to look at it is that cluster monitoring involves tracking and measuring data. Health checks show how well things are working (diagnostic).  sick_computer

Cluster health management is a powerful tool that should be in every cluster management solution. Just as a doctor takes your temperature, checks your blood pressure, and measures your heart rate, cluster management health checks provide a full system scan, every few minutes, to make sure everything is running within specified parameters.

Today’s enterprise-grade cluster health checks even can take preemptive actions when predetermined system thresholds are exceeded, saving you time and preventing hardware damage. The many issues that can affect optimal cluster or cloud performance create a need for ongoing health checks. Your ROI demands it.

Health Checking your System

Hardware and software tend to degrade over time. Things such as temperature or humidity can take a toll on equipment, such as fans or even the disks themselves. Software can get corrupted or become outdated creating problems that are hard to diagnose.

Health checks provide ongoing analysis of clusters-in-action data, proactively measuring critical information on each node. For example, a temperature threshold for GPUs can be established that results in the system automatically shutting down an overheated unit and sending a message to the system administrator. If problems arise, or a sub-optimal instance is detected, such as a node running very hot, the system alerts the user in real time, who then can take remedial action without downtime.     

Bottom Line

Today’s cluster and cloud management software provides comprehensive solutions for provisioning and managing HPC clusters, Hadoop clusters, and OpenStack clouds. Metrics can be monitored, visualized, and analyzed in a myriad of ways. Cluster health checks, on the other hand, can alert, prevent, and even predict many cluster- and cloud-related degradation and failure scenarios, bringing pinpoint clarity to cluster health issues—from hardware acceptance tests to ongoing operational use.

Bright Cluster Manager® comes with hundreds of built-in metrics and health checks. This is extensible so you can add your own business-specific metrics and health checks. And this functionality extends to both Hadoop and OpenStack.  


Next up: Black Hole Node Syndrome

High Performance Computing eBook