The Black Hole Node Syndrome
The dreaded "black hole node syndrome" silently and randomly kills productivity in HPC clusters. Although the workload manager reports that all nodes are running, sometimes the job executes, sometimes it crashes, leaving few clues for even the best system administrators to fix the problem.
In the worst cases, all the compute jobs are flushed from the queue for no apparent reason. Valuable compute hours are lost, energy wasted, other priorities sidelined; frustrated users hound beleaguered system administrators. Return on investment is impacted, both directly through increased operating expense, and indirectly, through the opportunity cost of downtime and redirected manpower.
Causes of Black Hole Node Syndrome
Most cluster management software is only capable of detecting when nodes are "dead." That capability is straightforward. The bigger problem, however, occurs when there are nodes that are unhealthy in a subtle way. These sick nodes will crash jobs, either consistently or, worse, on a seemingly random basis. Examples of node "illnesses" capable of crashing jobs include:
- GPU driver failed to load
- Unmounted parallel file system
- Full scratch disk
- Malfunctioning InfiniBand adapter
- Irregular system clock
- SMART errors on the disk drive
- System services not running
- External user authentication not working properly
In addition to crashing jobs, there are a number of performance-reducing ailments that most cluster management software overlook:
- GPU driver failed to load
- Rogue processes present on the node
- Degraded RAID array
- Swap memory is being used
- Network interfaces not up
Unless unhealthy nodes are detected, workload managers will continue to include them in jobs, causing continual repeat failures.
Black Hole Node Syndrome Prevention
There are three approaches to prevent black hole node syndrome:
- Extensive scripting
- Extending workload managers with custom scripts
- The Bright Answer: extensive automatic health-checking capabilities
There is of course, a fourth option: do nothing and accept job losses.
- Extensive scripting — Veteran HPC specialists typically work hard to prevent black hole node syndrome by writing a wide array of scripts to perform pre- and post-job health checks, usually developed on an iterative basis following job losses. This approach can solve the problem, but it is costly in terms of time and lost jobs as the iterations are addressed.
Less-experienced HPC users face a long learning curve, punctuated by a great deal of frustration. In either case, these scripts and workarounds are seldom documented, leaving the HPC facility at risk when the specialists move on to other roles or leave the organization.
- Workload Managers plus scripting — Workload managers can address part of the problem, but again, custom scripts must be written to fill in the gaps of these products. This approach potential reduces the scope of scripting, but comes with similar opportunity costs and organizational risks.
- The Bright Answer — Bright Cluster Manager provides an alternative that saves money, time and skilled people's patience while maximizing system throughput.
How Bright Cluster Manager Prevents the Black Hole Node Syndrome
Bright Cluster Manager, combined with integrated workload manager, goes beyond other cluster management software's ability to detect dead notes; Bright also detects unhealthy nodes that crash jobs or impede performance.
Bright Cluster Manager includes a framework that enables system administrators to define pre-job health check: specific low-impact tests that are run just before a job is executed to identify illnesses that can affect jobs. The system administrator can activate a wide selection of pre-built health checks, or define custom thresholds and actions. Bright instructs the workload manager to hold the job briefly while the nodes reserved for this job and other system elements are tested. If any node fails the health check, predefined actions are executed. Bright and the workload manager then dynamically reschedule the job to a set of healthy nodes while alerting the system administrator. The cluster remains productive, and the unhealthy node is flagged for further attention.
This framework provides a clear record of preventative actions activated to prevent Black Hole Node Syndrome, to ensure that other administrators do not waste time interpreting scripts written by colleagues or predecessors.
Bright Cluster Manager deeply integrates a wide range of admin-selectable workload managers, including PBS Professional®, SLURM, Torque/Maui, Torque/Moab, Grid Engine, LSF and OpenLava. Upon selection, Bright automatically configures and deploys the workload manager throughout the cluster, and provides monitoring and management of the workload manager through both the intuitive Bright Cluster Manager GUI,
its powerful cluster management shell and the web-based user portal. Additional benefits of the seamless integration include sampling, analysis and visualization of all key workload manager statistics within the Bright GUI, automatic head node failover; access to the Bright SOAP API, and the pre-job health checking capability.
“The black hole node syndrome is a serious issue. There are many subtle problems from apparently healthy nodes that can create cascading job failures.”
“The bigger the job, the more nodes, the higher the probability of failure.”
“We lose approximately .5% of jobs run due to BHNS - netting out to roughly 9,000 job crashes per year.”
“It's not always evident that there is a problem at first. The scheduler continuously assigns new jobs to the unhealthy nodes.”
“We don't realize there may be a problem until we notice that an extremely high rate of job failure has occurred on part of the cluster.”
— Dr. Don Holmgren, Computer Services Architect at Fermilab
“When a job fails, the source of the problem is often difficult to identify. A job that crashes can run fine on another cluster, or even on the same cluster if you run it again.”
“You can spend hours in the data center pulling your hair out, pulling up floor tiles, or worse. Is it the machine? The middleware? The code itself? Or the data?”
“I spent a hundred hours or more over 5 months writing scripts to isolate the problem - a significant time investment, and a lot of crashed jobs.”
— Jesse Trucks, HPC Cyber Security Administrator at ORNL