From the components that comprise a node (e.g., CPU cores plus accelerators and coprocessors, memory, disks), to the operating system and HPC software stack (e.g., client system for executing workload), device-specific metrics abound. As individual nodes communicate via interconnect fabrics (e.g., Ethernet, InfiniBand), metrics mushroom to reveal cluster-wide perspectives.
Everyone involved in running HPC clusters needs metrics. Hybrid-architecture, distributed systems, however, are a challenge to monitor, especially as systems scale out. Given the importance and inherent difficulties, it is not surprising that monitoring is once again topical in the HPC community. In fact, recent discussions are calling for an extreme makeover, by expressing the need to modernize monitoring for HPC clusters.Read More >