The most important task of the sysadmin in administering HPC clusters begins with the hardware. Since the idea behind a computer cluster is parallel operation and redundancy through connected nodes, it would follow that knowing whether a node is up, down or lagging would head the list.
It may sound simple, but usually it is not. Is the node unresponsive to pings? Or if it does respond, why can't you log into it? What if you can log into it, but it still cannot run an application or job? The bottom line has to be that if the node cannot do the work -- run the job or application -- the node is "down," because your cluster is not fully "up."
The inability to run a job on a node encompasses a complicated set of possible system glitches:
How to capture all the above in a single metric or using just one tool also isn’t easy. Here are two possible troubleshooting approaches:
After determining that a node is up or down, look at monitoring resource use. Clusters comprise a set of common resources -- the system's processors, memory, local hard disks, the network, your central storage.
Intelligent cluster design is really nothing more than running applications, which have a known drain on system resources and getting the right combination of cluster resources together to do the job. Monitor your resource usage to know whether or not you are applying the right resource combination to carry the application load.
Looking for the best tool to manage your cluster? If you're thinking of upgrading your cluster support and want to learn more about how you can extend your cluster to the cloud, contact us. Whether your needs are for the standard or advanced arrays of cluster design and support, we have the cluster manager solution for your business.