To extract the most value from your HPC cluster, you need to ensure that system resources are being properly utilized. HPC system users are notorious for over-requesting resources for their jobs, resulting in idle or underutilized resources that could otherwise be doing work for other jobs. While one reason for this can be users hoarding resources to ensure they have what they need, another common reason why users over request resources is that they simply don’t know what resources their jobs will need to complete the job in a specified time. For administrators to ensure that their precious and expensive cluster resources aren’t being squandered, they need to get actionable details regarding how the resources are being used. More specifically, they need to know things like which jobs are using which resources, which jobs aren’t using resources that they’ve provisioned and which users are repeatedly hoarding resources unnecessarily, as well as other things.
Bright Cluster Manager has a feature called Workload Accounting and Reporting (aka WAR) that gives cluster administrators the insights they need to answer these types of questions.
This graph shows that jobs run by user David are wasting about 42% of the allocated CPUs, while jobs run by Charlie are wasting about 30%. Perhaps more importantly, none of the jobs are using the GPUs effectively, but user Bob is wasting significantly more than anyone else. A little investigation using Bright’s integrated job monitoring reveals the problem. Bob’s Convolutional MNIST jobs are running, but they aren’t doing anything. They are occupying a GPU without actually using it. The matrix below tells the story. On the left is one of Bob’s Convolutional MNIST jobs that ran properly, on the right is a more recent job that did not use the GPU.
Administrators can use this information to enlighten users as to the resources requirements of their jobs, facilitate chargeback reports if desired, and provide visibility for management to understand how the system is being used and provide rationale for investing in additional resources when warranted. Even if users are not actually charged for the resources they use, the visibility provides insight into how much the jobs being run actually costs and stakeholders can then use this information to make value judgments. Accounting and reporting views are also available for end users of a cluster through the user portal interface. This allows users to gain insight into how their jobs are utilizing the resources that they request.