Some cluster health issues are easier to detect than others—with issues ranging from new equipment delivery checks for defective devices to the hard to detect Black Hole Node Syndrome (BHNS). In a recent survey, more than 64% of respondents stated that black hole node syndrome has affected their systems and their work.
Black Hole Node Syndrome: Common Causes
In general, node failure in HPC systems is a somewhat common occurrence that normally would not create significant issues or downtime. Dead nodes are easy to detect. Once noticed, administrators simply work around the downed units, omitting them when scheduling jobs.
A bigger, yet much harder problem to spot is Black Hole Node Syndrome, where the nodes seem to be working according to specs but are unhealthy in less noticeable, often random, ways. Unhealthy nodes can act as “black holes”, sucking all jobs out of a workload manager queue, leaving users and system administrators wondering where all the jobs went. Causes of BHNS can include a GPU driver that failed to load; an irregular system clock; an unmounted parallel file system; a malfunctioning InfiniBand adapter; errors on the disk drive; system services not running; or a host of other issues. If undetected, these bad nodes can cause recurrent or cascading failures.
Today’s biggest supercomputers are using tens of thousands of cores. And tomorrow’s machines, with projected million-plus cores, are just around the corner on the technology highway. Parts failure, seen and unseen, will be an ongoing issue. And cluster monitoring and health management solutions, such as Bright Cluster Manager, will be essential for maintaining productivity within a healthy system.
Remember to subscribe to our blog to stay on top of the latest tips, innovations, and best practices for cluster management.