Cluster Manager Advice for Sysadmins: Software Monitoring Means Better Cluster Management


By Brady Black | September 02, 2014 | HPC cluster management, HPC




Our two previous blogs on advice to sysadmins were mostly about HPC systems hardware monitoring. In case you missed them you can get caught up by reading part 1 and part 2.  If cluster management can be likened to a four-legged stool, the first three legs would be your node processors, memory, and network monitoring. At some point, however, sysadmins need to pay close attention to the fourth leg — software monitoring.

HPC systems guru Jeff Layton writes in

“As a starting point, two primary things you should be monitoring are the resource manager (job scheduler) and the tools (software packages/versions) clients are using.”

It all ties in with overall cluster resource management, which is vital if you want to get the most out of your system. Staying on top of the jobs streaming throughout the cluster can reveal a great deal:

  •    how long any specific job ran
  •    who on the system ran the job
  •    when the job was requested
  •    when the job started
  •    how long the job sat in the queue
  •    how many jobs are in the queue at one time
  •    in the case of multiple queues, which queues typically have the most jobs waiting
  •    the most popular times of day and days during the week when jobs are running

Knowing all that information not only helps the sysadmin get a deeper understanding of the what the cluster is doing, but also provides the analytics and insight required to improve system performance.

A powerful aspect of clusters is the ability to run multiple operating systems as well as different versions of the same software available. This allows tweaking and experimentation to optimize cluster management.

Insuring all the different versions of software, and different cross-compiled configurations are available to appropriate users is key to streamlining the management of the cluster. This is where HPC system environment modules comes into play. The module system allows users to individually (or as a group) select their preferred codes, compilers and libraries.

Before analyzing the software usage data, you have to collect it and store it somewhere. Here again, cluster management software can help by providing the data, and providing tools to analyze and visualize it later. This avoids the hassle of tracking by means of a wrapper script and shunting the data into a database for querying the information.

Read all about Harvard University’s approach and the module loading commands that have been successful for Jeff Layton in the piece “Gathering Data on Environment Modules.”

Whatever the complexity of your cluster environment, Bright Computing has the cluster manager solution to help you harness the power of big data in a variety of configurations and environments. Contact us and see why we lead the pack in cluster support.

High Performance Computing eBook