The most important task of the sysadmin in administering HPC clusters begins with the hardware. Since the idea behind a computer cluster is parallel operation and redundancy through connected nodes, it would follow that knowing whether a node is up, down or lagging would head the list.
It may sound simple, but usually it is not. Is the node unresponsive to pings? Or if it does respond, why can't you log into it? What if you can log into it, but it still cannot run an application or job? The bottom line has to be that if the node cannot do the work -- run the job or application -- the node is "down," because your cluster is not fully "up."
The inability to run a job on a node encompasses a complicated set of possible system glitches:
- network connectivity
- storage availability
- user authentication
- resource allocation
How to capture all the above in a single metric or using just one tool also isn’t easy. Here are two possible troubleshooting approaches:
- Ping a node or run a simple script on a node (or have the master node ssh, for example with a user name -r) to see if the node is “alive.” If the script runs successfully, the node may be alive but necessarily well. For the "unwell" node, the foregoing technique can isolate slow-running nodes that aren’t completing the command for an excessively long time.
- Create a short job for the node to handle between user jobs. The diagnostic can be more than just a simple piece of code. It could be a housekeeping task you need to do before launching the next big user job.
After determining that a node is up or down, look at monitoring resource use. Clusters comprise a set of common resources -- the system's processors, memory, local hard disks, the network, your central storage.
Intelligent cluster design is really nothing more than running applications, which have a known drain on system resources and getting the right combination of cluster resources together to do the job. Monitor your resource usage to know whether or not you are applying the right resource combination to carry the application load.
Looking for the best tool to manage your cluster? If you're thinking of upgrading your cluster support and want to learn more about how you can extend your cluster to the cloud, contact us. Whether your needs are for the standard or advanced arrays of cluster design and support, we have the cluster manager solution for your business.