The HPC industry is marching towards the delivery of the first exascale systems, composed of tens-of-thousands or hundreds-of-thousands of server nodes and exotic software stacks to manage it all. While the first systems will have a limited audience, exascale systems will increasingly move down-market to reach large numbers of enterprises. As exascale systems make their way into commercial enterprises, the need for more standardized, out-of-the-box management tools will follow.   

As systems scale to exceptionally large numbers of servers, new limitations begin to surface that impact the system’s performance and how far the system can effectively scale. In particular, monitoring the health and metrics of the cluster’s many servers taxes the system’s head node to the point where performance peaks prematurely.  

Traditionally, the head nodes of a cluster are responsible for monitoring all compute nodes. That works well for clusters of up to 20 to 30 thousand nodes. For larger clusters, it makes sense to have dedicated monitoring nodes. Bright Cluster Manager 9.1 now allows you to add as many monitoring nodes as you like. Monitoring nodes can be set up to monitoring specific groups of nodes in the cluster. And, when a particular monitoring node fails, other monitoring nodes can take over automatically.

As ar result, in Bright Cluster Manager 9.1, the cluster monitoring function for the entire system can be “offloaded” from the system’s head node to a set of dedicated servers that perform system monitoring exclusively, freeing the head node to perform its other duties and allowing the system to continue scaling. If a dedicated monitoring node fails, the remaining monitoring nodes will take over monitoring the orphaned compute nodes until the failed monitoring node is reinstated.  

As a result, Bright Cluster Manager can now build and manage clusters of up to 100,000 nodes, depending on how many dedicated monitoring nodes are used. In addition, the dedicated monitoring nodes automatically back-up monitoring data from the system, ensuring that monitoring data isn’t lost.

