Some cluster health issues are easier to detect than others—with issues ranging from new equipment delivery checks for defective devices to the hard to detect Black Hole Node Syndrome (BHNS). In a recent survey, more than 64% of respondents stated that black hole node syndrome has affected their systems and their work.
Black Hole Node Syndrome: Common Causes
In general, node failure in HPC systems is a somewhat common occurrence that normally would not create significant issues or downtime. Dead nodes are easy to detect. Once noticed, administrators simply work around the downed units, omitting them when scheduling jobs.
A bigger, yet much harder problem to spot is Black Hole Node Syndrome, where the nodes seem to be working according to specs but are unhealthy in less noticeable, often random, ways. Unhealthy nodes can act as “black holes”, sucking all jobs out of a workload manager queue, leaving users and system administrators wondering where all the jobs went. Causes of BHNS can include a GPU driver that failed to load; an irregular system clock; an unmounted parallel file system; a malfunctioning InfiniBand adapter; errors on the disk drive; system services not running; or a host of other issues. If undetected, these bad nodes can cause recurrent or cascading failures.
Cures that WorkIn general, there are three ways in which to overcome BHNS. The first two, extensive scripting or extending workload managers with custom scripts, are time consuming and costly. The third alternative is an extensive, automatic health checking protocol that only a few cluster management solutions can provide. Therefore, when reviewing your cluster management needs, ask the following questions:
- Does the cluster management solution have extensive, built-in health-checking capabilities?
- If yes, can these capabilities be customized or extended for business-specific needs?
- How much overhead is associated with health checking?
- How much effort is required to integrate the workload manager into the overall solution?
- Can the workload manager schedule health checks?
- Can it improve your ROI?
Today’s biggest supercomputers are using tens of thousands of cores. And tomorrow’s machines, with projected million-plus cores, are just around the corner on the technology highway. Parts failure, seen and unseen, will be an ongoing issue. And cluster monitoring and health management solutions, such as Bright Cluster Manager, will be essential for maintaining productivity within a healthy system.
Remember to subscribe to our blog to stay on top of the latest tips, innovations, and best practices for cluster management.