GPU Cluster Management & Monitoring

page_header_divider_line

Bright Cluster Manager includes powerful GPU management and monitoring capabilities that leverage functionality in NVIDIA® Tesla™ GPUs to take maximum control of the GPUs and gain insight in their status and activity over time. Bright also includes the necessary CUDA and OpenCL libraries.

GPU Monitoring

3Visualizing one or more GPU metrics in a graph is very easy with Bright Cluster Manager.

 

Bright Cluster Manager can sample and monitor metrics from supported GPUs and GPU Computing Systems, such as the Kepler-architecture NVIDIA Tesla K80 dual-GPU accelerator as well as collections of GPU accelerators in a single chassis. 

Examples of supported metrics include:

  • GPU temperatures
  • GPU exclusivity modes
  • GPU fan speeds
  • system fan speeds
  • PSU voltages and currents
  • system LED states
  • GPU ECC statistics (Fermi GPUs only)

The frequency of metric sampling is fully configurable, as is the consolidation of these metrics over time. Metrics are stored in Bright Cluster Manager's monitoring database; they can be visualized in value/time graphs, as well as in Bright Cluster Manager's unique Rackview.

page_header_divider_line

GPU Management

4

Bright Cluster Manager allows for alerts and actions to be triggered automatically when GPU metric thresholds are exceeded. Such rules are completely configurable to suit your requirements, and any built-in cluster management command, Linux command, or shell script can be used as an action.

For example, if you would like to automatically receive an email and shut down a GPU node when its GPU temperature exceeds a set value, this can easily be configured in Bright Cluster Manager.

page_header_divider_line

Cluster Health Management for GPU Clusters

5

Cluster Health Management can also include health checks for GPU cards and GPU Computing Systems in GPU clusters. Any of the supported GPU metrics can be used in regular and prejob health checks.

For example, you could configure a prejob health check called "AllFansRunning" and define an appropriate action when the health check has status FAIL.