Bright Computing Logo

NVIDIA GPU Management & Monitoring

Home > Products > NVIDIA GPU Management & Monitoring

NVIDIA GPU Management & Monitoring

Bright Cluster Manager® includes powerful GPU management and monitoring capabilities that leverage functionality in NVIDIA® Tesla™ GPUs to take maximum control of the GPUs and gain insight in their status and activity over time. Bright also includes the necessary CUDA and OpenCL libraries.

GPU Monitoring

Bright Cluster Manager can sample and monitor metrics from supported GPUs and GPU Computing Systems, such as the Kepler-architecture NVIDIA Tesla K40 GPU accelerator as well as collections of GPU accelerators in a single chassis (e.g., GPU Units such as the Dell PowerEdge C410x PCIe Expansion Chassis).

Examples of supported metrics include:

  • GPU temperatures;
  • GPU exclusivity modes;
  • GPU fan speeds;
  • system fan speeds;
  • PSU voltages and currents;
  • system LED states;
  • GPU ECC statistics (Fermi GPUs only).

See the table below for a complete overview of supported metrics and GPUs.

The frequency of metric sampling is fully configurable and so is the consolidation of the metrics data over time. Metrics data is stored in Bright Cluster Manager's central SQL database and can be visualized in value/time graphs, as well as in Bright Cluster Manager's unique Rackview.


Bright Cluster Manager ScreenshotBright Cluster Manager Screenshot

GPU Management

Bright Cluster Manager allows for alerts and actions to be triggered automatically when GPU metric thresholds are exceeded. Such rules are completely configurable to suit your requirements, and any built-in cluster management command, Linux command, or shell script can be used as an action.

For example, if you would like to automatically receive an email and shut down a GPU node when its GPU temperature exceeds a set value, this can easily be configured in Bright Cluster Manager.

Cluster Health Management for GPU Clusters

Cluster Health Management can also include health checks for GPU cards and GPU Computing Systems in GPU clusters. Any of the supported GPU metrics can be used in regular and prejob health checks.

For example, you could configure a prejob health check called "AllFansRunning" and define an appropriate action when the health check has status FAIL. The screenshot of the Rackview on the right shows that this indicator for GPU Unit 41 has status FAIL.

Supported NVIDIA GPU Cards and Computing Systems

The table below shows which metrics are available on various NVIDIA, NextIO and Dell GPU cards and units*:

GPU Metric M2050
M2070
M2090*
S2050 NextIO
vCORE
Express
NextIO
vCORE
Express Extreme
Dell
C410x
GPU temperatures -        
GPU compute mode          
GPU display mode n/a n/a n/a n/a n/a
GPU persistence mode          
GPU exclusivity mode          
GPU fan speeds n/a n/a n/a n/a n/a
Unit fan speeds  n/a        
Unit/Device serial          
Unit firmware version  -        
Unit temperature  n/a        
Unit power usage  n/a        
Voltages and currents  n/a        
Unit LED states  n/a        
GPU ECC statistics          
Board serial number          
GPU utilization          
Driver version  -  -  -  -  -
Memory utilization          
PCI information          
* This matrix only gives an indication of availability of features. Actual availability may vary depending on time, model number, vendor, OEM, etc.
 

Read more about the NVIDIA Tesla GPU models on the NVIDIA website.
Read more about the NextIO GPU models on the NextIO website.
Read more about the Dell GPU models on the Dell website.

 
 
Quote



 

Bright Cluster Manager can leverage the management and monitoring capabilities of supported NVIDIA© Tesla™ GPU cards and rack-mounted GPU Computing Systems.


 

Product Features

Overview
Editions
Based on Linux
Intel Cluster Ready
Installation
Cluster Management GUI
Node Provisioning
Monitoring
Cloud Utilization
GPU Management
ScaleMP Management
Workload Management
Cluster Health Management
Advanced Features
User Portal
NVIDIA CUDA & OpenCL

Solutions

HPC Cluster
Hadoop Cluster
OpenStack

Customers

Customer Testimonials
Analyst Testimonials
Partner Testimonials

Resources

About Bright
Case Studies
Data Sheets
White Papers
Analyst Reports
Bright ROI Calculator
Bright Cluster Manager
HPC
Hadoop
OpenStack
Support
Product Demos
Webinars
Videos
Manuals

About

Bright
News
Events
Webinars
Awards
Press Center
Careers
Contact Us

Where to Buy

Where to Buy
Resellers Africa
Resellers Asia
Resellers Canada
Resellers Europe
Resellers Middle East
Resellers Russia
Resellers South America
Resellers USA

Contact us

+1 408 300 9448
info@brightcomputing.com
Twitter: @BrightComputing

Connect



 
 
Site Map | Legal | © 2009–2014 Bright Computing, Inc. All rights reserved.