How to Quickly Burn Test Your HPC Cluster Using the Bright Cluster Manager CMGUI

    

The Bright Cluster Manager burn framework is used to automatically run a set of test scripts on one or more HPC cluster nodes. This process allows you to verify that the hardware is functioning and that the cluster is fully operational — for example, before you sign off on the installation. The burn framework is installed by default on all Bright clusters.

Bright provides two ways to use the burn framework: via the Cluster Manager Shell (CMSH) or by using the Bright CMGUI. This article describes how to burn test nodes using the CMGUI. Burn tests using the CMSH will be covered in a separate posting. 

Getting started:

Select the node you want to burn from the resource tree, then click on the burn tab. Press the "Start new burn" button.

hpc cluster

This image has been resized to fit in the page. Click to enlarge.

The "Select burn configuration" dialog box is displayed. Choose the "default" or "long-hpl" burn configuration from the "Burn configuration" select list. We'll run the default burn test. Note that we could easily edit the burn configuration from here, for example if we wanted to remove a phase. Press the "Ok" button to continue.

hpc cluster
This image has been resized to fit in the page. Click to enlarge.

Bright automatically reboots the node and starts the burn-in test.

hpc cluster
This image has been resized to fit in the page. Click to enlarge.

Select the "Nodes" resource from the resource tree, then the "Burn Overview" tab. This screen shows that phase2 of the burn tests is running. Bright runs each phase of a burn test in series and the tests within a phase in parallel. Right now the disktest, mce_check and kmon tests are running. 

hpc cluster
This image has been resized to fit in the page. Click to enlarge.

The final phase in the default burn configuration runs memtest86 indefinitely. Press the "Cancel burn" button on the node's "Burn" tab to end the burn test. The node will automatically reboot and will be returned to service.
High Performance Computing eBook