Bright Health
When a cluster is used intensively, its performance and availability can decline over time.
Bright Cluster Manager™ includes Bright Health — a set of powerful tools that help you keep your cluster healthy.
The toolset was developed over many years and is based on our extensive experience in managing and using large and complex clusters.
Why You Need Bright Health
An HPC cluster is not a static system — it evolves over time as hardware and software are subject to change, especially when used intensively:
- A cluster contains many hardware parts that are subject to wear and tear, in particular moving parts such as fans and disks.
- Filesystems can get corrupted.
- User applications can spawn processes that are not killed after the job has finished.
- Software packages can go out-of-date and become incompatible with newer versions of other packages.
- Users and system administrators may leave old software in places where they affect the system.
Key Features & Packages
The "Bright Health" toolset keeps your cluster healthy by thoroughly and frequently checking many health indicators and taking action where necessary. Some key features of Bright Health include:
- Continuous monitoring of the cluster's health.
- Autonomous by-pass repair of faulty nodes.
- Many hardware performance and consistency checks.
- Automatic job requeuing, avoiding queue flushing.
- Process "jailing" in order to allocate, track & trace user processes.
Bright Health consists of four packages:
- Cluster Sweeper
- PreJob Checker
- CPUset Manager
- MPI Integrator
Bright Health Cluster Sweeper
The Bright Health Cluster Sweeper checks the health of hardware and software at regular intervals.
When a node is not used by a user, several tests are performed, for example:
- Memory performance tests
- CPU performance test
- Hardware consistency check compared to hardware group template
- Filesystem mount-point check
- User account access tests
- MPI performance tests between sets of nodes
- Infiniband and Ethernet tests
- SSH connection test
- Unauthorized user process removal
- User defined tests
When a node is found to have a hardware fault it is automatically marked offline in the cluster's workload manager and the administrator is notified as required.
The MPI test runs between pairs of nodes checking the Infiniband or Ethernet network conditions.
An important feature of the Cluster Sweeper is that it will "clean" the cluster from rogue user processes.
Rogue processes are processes that user jobs sometimes leave behind after a job has finished.
Bright Health PreJob Checker
The Bright Health PreJob Checker consists of a number of tests which are run before a job starts. The tests usually take less than a few seconds. If a test fails, the faulty node is taken offline in the workload manager, the administrator is notified and the job is re-queued. The tests are similar to those in the Cluster Sweeper. The advantage of the PreJob Checker is that the workload manager is no longer flushed empty if one of the nodes is faulty. On large clusters this is an essential tool.
Bright Health CPUset Manager
Bright Health CPUset Manager allows for "jailing" and allocating processes to certain CPU cores. Even the threads or "children" created by those processes are "jailed" to the same core. This is useful when multiple users run on the same large multi-core node (such as a quad or octet socket server, or a ScaleMP or NumaScale solution).
Processes belonging to the same user job are migrated to different CPU cores depending on the chosen logic as some applications run more efficient if the processes are allocated to the cores belonging to the same CPU. For other applications the exact opposite is true. Bright Health CPUset Manager deals with allocating and migrating processes, and the complexity of deciding what to do if multiple users are sharing the same machine.
Bright Health MPI Integrator
MPI typically spawns its processes using the SSHSecure Shell or SSH is a network protocol that allows data to be exchanged using a secure channel between two networked devices. mechanism, thus not informing the workload manager. The Bright Health MPI Integrator tracks and traces user processes. For example, if a user issues a qdel command, processes will automatically be cleaned up after the job has finished.
Conclusion
All these Bright Health packages are part of Bright Cluster Manager and are designed to create the highest throughput in jobs, the best overall cluster efficiency and the lowest administrative overhead.
|