Cluster Management Automation
Cluster Management Automation is a very powerful feature for cluster administrators. It allows you to set a threshold for any metric and define any action to be taken when that threshold is exceeded. Any of the built-in or custom metrics supported by Bright Cluster Manager® can be used and any cluster management shell or Linux command or script can be used as an action.
Examples of Actions
Some examples of "actions" that can be configured with Bright Cluster Manager® include:
Examples of Rules
A configuration wizard is available to guide you through the steps of defining a rule.
Some examples of "rules" that can be configured with Bright Cluster Manager include:
- If the amount of free space in /home goes below 9.3 Gigabyte, send an email to administrator@localhost.
- If the number of running jobs exceeds 120, log an event in the GUI event viewer.
- If the temperature in any of the nodes in node category "Large SMP Nodes" exceeds 60 degrees Celsius, send an SMS text message to mobile phone number +1 123 123 1234 and shutdown the offending node.
This tool is very powerful and can be a real time-saver.
For example, you can monitor the health of your cluster and take preemptive action when hardware shows signs of imminent failure, or you can monitor usage of your cluster and take preemptive action before the cluster runs out of resources.
A configuration wizard is available to guide you through the steps of defining a rule, which includes selecting a metric, defining a threshold and defining an action.
The Automated Cluster Management system is sophisticated and highly configurable. One example is its ability to deal with so-called "state flapping", which is a situation where a threshold is exceeded repeatedly within a short time frame. This can, for example, happen when a CPU temperature fluctuates around a configured threshold, potentially causing the system to send out many emails in a short time frame. The system is able to detect such a situation and can be configured exactly how to deal with it.
“The centralized status and health information database simplifies troubleshooting and reduces service disturbances.”
— Erik Engquist, Systems Administrator at the University of Houston