Bright Goes to WAR with Workload Accounting and Reporting

    

Data center managers are continually under pressure to deliver more results from their existing computing resources. One strategy that has proven successful is the aggregation of what have traditionally been computing silos into a modern, and efficient shared compute cluster.

While even a little resource sharing is good, the greatest rewards go to those who understand how jobs use resources (e.g., RAM, CPU) as they run. Organizations can extract more value from the cluster by requiring users to specify accurate resource requirements when they submit jobs. Through its end-user portal, Bright Workload Accounting and Reporting gives the users the information they need to allow more jobs to run simultaneously, which results in higher throughput.

Convincing stakeholders to hand over their hardware into the control of another group (usually a centralized IT group) is by no means assured unless the directive comes from the executive level. The best strategy may be to work with all the stakeholders simultaneously, actively selling the advantages of a shared clustered infrastructure to each. In addition, you will also need to convince them that they will get their fair share of the resources, i.e., in proportion to the amount of hardware they are contributing, with the added bonus of being able to run far more jobs overall. And finally, if possible, you should provide each group with an accounting report that shows their resource usage for the period.

There are two types of accounting reports, “showback,” and “chargeback.” A showback report is a chart and/or table that shows each groups resource usage in terms of a set of metrics chosen by the organization. A chargeback report goes one step further. It assigns a unit cost to each metric, and each group is charged for their actual resource usage. Bright Cluster Manager Workload Accounting and Reporting allows you to generate both types of reports.

Bright Workload Accounting and Reporting (WAR) is a feature of Bright Cluster Manager 8.2 that combines the metrics that are automatically sampled on all cluster nodes with job metrics that are collected from the cgroup each job runs within to produce reports that answer the questions that are important to your business. The questions are expressed through the Prometheus Query Language (PromQL) queries. An intelligent set of queries is available out of the box, and customers can add as many of their own queries as necessary.

It’s not uncommon for users to game the system by submitting jobs that hold valuable resources (e.g., GPUs) until they are ready to use them. This practice causes a lot of disruption, and left unchecked, tempts other users to follow suit, exacerbating the problem. Bright Workload Accounting and Reporting can easily identify jobs that are wasting resources so you can nip the problem in the bud. Naturally, when fewer resources are wasted there are more resources available for production work.

Once the contributing groups are using the shared cluster, the cluster is tuned and running at maximum throughput, and expensive resources are being used productively, if jobs are still languishing in the queues for an unacceptably long time, Bright can produce reports that show why they are pending. Are there insufficient job slots or are you short the software application licenses required to run the pending jobs? If there are not enough job slots, it’s time to consider adding additional compute nodes. But if there are not enough licenses, it’s time to buy additional seats.

Bright Workload Accounting and Reporting provides data center managers with the knowledge they need to convince stakeholders to contribute their hardware to a shared compute cluster, to prove that the contributing groups are receiving the resources they were promised, to maximize the throughput and efficiency of the shared cluster, and to make the case for new hardware or software resources when needed.