Bright Computing Logo

Black Hole Node Syndrome

Home > Products > The Black Hole Node Syndrome

The Black Hole Node Syndrome

The dreaded "black hole node syndrome" silently and randomly kills productivity in HPC clusters. Although the workload manager reports that all nodes are running, sometimes the job executes, sometimes it crashes, leaving few clues for even the best system administrators to fix the problem.

In the worst cases, all the compute jobs are flushed from the queue for no apparent reason. Valuable compute hours are lost, energy wasted, other priorities sidelined; frustrated users hound beleaguered system administrators. Return on investment is impacted, both directly through increased operating expense, and indirectly, through the opportunity cost of downtime and redirected manpower.

Causes of Black Hole Node Syndrome

Most cluster management software is only capable of detecting when nodes are "dead." That capability is straightforward. The bigger problem, however, occurs when there are nodes that are unhealthy in a subtle way. These sick nodes will crash jobs, either consistently or, worse, on a seemingly random basis. Examples of node "illnesses" capable of crashing jobs include:

  • GPU driver failed to load
  • Unmounted parallel file system
  • Full scratch disk
  • Malfunctioning InfiniBand adapter
  • Irregular system clock
  • SMART errors on the disk drive
  • System services not running
  • External user authentication not working properly

In addition to crashing jobs, there are a number of performance-reducing ailments that most cluster management software overlook:

  • GPU driver failed to load
  • Rogue processes present on the node
  • Degraded RAID array
  • Swap memory is being used
  • Network interfaces not up

Unless unhealthy nodes are detected, workload managers will continue to include them in jobs, causing continual repeat failures.

Black Hole Node Syndrome Prevention

There are three approaches to prevent black hole node syndrome:

  1. Extensive scripting
  2. Extending workload managers with custom scripts
  3. The Bright Answer: extensive automatic health-checking capabilities

There is of course, a fourth option: do nothing and accept job losses.

  1. Extensive scripting — Veteran HPC specialists typically work hard to prevent black hole node syndrome by writing a wide array of scripts to perform pre- and post-job health checks, usually developed on an iterative basis following job losses. This approach can solve the problem, but it is costly in terms of time and lost jobs as the iterations are addressed.

    Less-experienced HPC users face a long learning curve, punctuated by a great deal of frustration. In either case, these scripts and workarounds are seldom documented, leaving the HPC facility at risk when the specialists move on to other roles or leave the organization.
  2. Workload Managers plus scripting — Workload managers can address part of the problem, but again, custom scripts must be written to fill in the gaps of these products. This approach potential reduces the scope of scripting, but comes with similar opportunity costs and organizational risks.
  3. The Bright Answer — Bright Cluster Manager provides an alternative that saves money, time and skilled people's patience while maximizing system throughput.

How Bright Cluster Manager Prevents the Black Hole Node Syndrome

Bright Cluster Manager, combined with integrated workload manager, goes beyond other cluster management software's ability to detect dead notes; Bright also detects unhealthy nodes that crash jobs or impede performance.

Bright Cluster Manager includes a framework that enables system administrators to define pre-job health check: specific low-impact tests that are run just before a job is executed to identify illnesses that can affect jobs. The system administrator can activate a wide selection of pre-built health checks, or define custom thresholds and actions. Bright instructs the workload manager to hold the job briefly while the nodes reserved for this job and other system elements are tested. If any node fails the health check, predefined actions are executed. Bright and the workload manager then dynamically reschedule the job to a set of healthy nodes while alerting the system administrator. The cluster remains productive, and the unhealthy node is flagged for further attention.

This framework provides a clear record of preventative actions activated to prevent Black Hole Node Syndrome, to ensure that other administrators do not waste time interpreting scripts written by colleagues or predecessors.

Bright Cluster Manager deeply integrates a wide range of admin-selectable workload managers, including PBS Professional®, SLURM, Torque/Maui, Torque/Moab, Grid Engine, LSF and OpenLava. Upon selection, Bright automatically configures and deploys the workload manager throughout the cluster, and provides monitoring and management of the workload manager through both the intuitive Bright Cluster Manager GUI, its powerful cluster management shell and the web-based user portal. Additional benefits of the seamless integration include sampling, analysis and visualization of all key workload manager statistics within the Bright GUI, automatic head node failover; access to the Bright SOAP API, and the pre-job health checking capability.

 
 
Quote
Quote
Next Steps

 

Home

Home page

Product Features

Overview
Editions
Based on Linux
Intel Cluster Ready
Installation
Cluster Management GUI
Node Provisioning
Monitoring
Cloud Bursting
GPU Management
ScaleMP Management
Workload Management
Cluster Health Management
Advanced Features
User Portal
NVIDIA CUDA & OpenCL

Customers

Customer Testimonials
Analyst Testimonials
Partner Testimonials

Where to Buy

Where to Buy
Resellers Asia
Resellers Canada
Resellers Europe
Resellers Middle East
Resellers Russia
Resellers South America
Resellers USA

Company

About
News
Events
Employment
Where to buy

Resources

Videos
Brochures
Analyst Reports

Contact us

+1 408 300 9448
info@brightcomputing.com
Twitter: @BrightComputing

Connect



 
 
Site Map | Legal | © 2009–2013 Bright Computing, Inc. All rights reserved.