Monitoring HPC Clusters: Is It Time for an Extreme Makeover?


By Ian Lumb | November 13, 2014 | HPC Cluster, Linux Cluster Management, Management, Monitoring



25725271_mFrom the components that comprise a node (e.g., CPU cores plus accelerators and coprocessors, memory, disks), to the operating system and HPC software stack (e.g., client system for executing workload), device-specific metrics abound. As individual nodes communicate via interconnect fabrics (e.g., Ethernet, InfiniBand), metrics mushroom to reveal cluster-wide perspectives.

Everyone involved in running HPC clusters needs metrics. Hybrid-architecture, distributed systems, however, are a challenge to monitor, especially as systems scale out. Given the importance and inherent difficulties, it is not surprising that monitoring is once again topical in the HPC community. In fact, recent discussions are calling for an extreme makeover, by expressing the need to modernize monitoring for HPC clusters.

I am delighted to see monitoring getting the attention it rightly deserves. Because there are different approaches to modernizing monitoring for HPC clusters, I recently shared my perspective over at insideHPC. In the article, I discuss the different schools of thought when it comes to modernizing monitoring for HPC. By aligning yourself with a particular approach at the outset, you are making future commitments regarding the outcomes that are achievable with your monitoring system. The article concludes:

“If your needs are exclusively for monitoring your HPC environment, a meta-toolkit [a toolkit about toolkits] may suffice. However, if you seek a more comprehensive and future-proofed solution for monitoring that also includes provisioning and management capabilities, you need a unified solution that has these integrated capabilities architected in from the outset.”

I encourage you to read the article in its entirety at insideHPC so that you can make informed decisions regarding your options for monitoring. Of course, members of the Bright Computing team are ready to engage in detailed discussions regarding your monitoring needs for your HPC environment.

Modernized monitoring systems are a critical service for HPC clusters of any size. In fact, monitoring factors into the more than one of the 5 essential strategies for successful HPC clusters. Please click here to read the insideHPC whitepaper that details these success strategies.

High Performance Computing eBook