In a previous blog we briefly discussed how sysadmins need to focus on the hardware side of the HPC cluster equation. Clustering, of course, is where businesses can approach the power of the “super computer,” and the nodes are the “brain cells” that need to be monitored and nurtured.
What clusters do best is run software applications in a parallel and resource-sharing way, where the sum of their power can be diminished by misallocated parts. Applications produce the data analytics and output that run the business, which can come to a sudden halt if issues like network monitoring and memory usage aren’t continually addressed.
It’s no good telling the CEO that your cluster’s operating system and applications broke down, and you lost productive time trying to reconstruct that batch of accounts receivable invoices. The CEO is liable to ask some embarrassingly probing questions. You answers will undoubtedly have to go far past that old standby excuse, “Unix ate my RAM.”
Then there’s the double-edge sword of monitoring your network usage. Network management can come down to anticipating bottlenecks in application performance, which, in turn, makes monitoring all the more difficult. Get a handle on network usage, though, and you can start directing traffic through your cluster and avoid the network equivalent of rush-hour gridlock.
Gridlock and system slowdown can occur as users task the cluster and share application resources. Measuring the latter is, depending on just what you want to measure, can be problematic.
Should you measure what each person on the system is using, or the overall memory usage on each node? It’s not exactly mixing apples and oranges, but the composite of analytics can be somewhat like a fruit salad.
Whichever question you try to answer is complicated because:
- The Linux operating system runs its memory gathering independently of application demands. (You can read more aboutabout that in this Virtual Threads article)
- Determining how much memory is being used by running applications is also a difficult task. The closest you can come is through somewhat of a back-door trick of determining how much free memory is available and doing some subtracting.
HPC Admin guru Jeff Layton puts it this way:
“At the very least, you’ll be able to measure how much memory is being used minus the buffers and caches, which will include all user applications, root applications, and shared libraries. It might not be exactly what you want, but getting something more detailed or granular requires a great deal more work.”
On the other hand, using the right cluster manager solution can get you closer to the aforementioned granular solution.