Everyone in the computer business knows just how hot the machine learning (ML) space is today. The promise, as well as the demands, being placed on the AI data scientist by their companies, are numerous. For most of these people, their GPU laden computers used to run analysis are viewed as just a tool. Often, each data scientist is provided with their own powerful computer and they don’t want to be burdened with the need to operationalize these computers. Simply put, they just want to run their jobs as quickly as possible.
For most companies, there is a desire to provide their data scientists with powerful compute resources. However, as these computers scale in performance, especially those incorporating the latest GPUs, they are becoming extremely costly to procure and operate. Likewise, IT organizations dealing with large scale AI compute are often buried with managing these unique ML resources and staying up to date with all the evolving technologies. Thus, being good stewards of the company’s budget forces IT into the challenging position of staying current with this evolving technology while also maximizing the company’s compute resources.
Is there a life preserver that can be thrown to both the data scientists and IT administrators? I think so, and it may be surprising, but the characteristics displayed with high-performance computing (HPC) and their need for extreme compute requirements is similar. With that said, HPC users have had decades of experience fine-tuning and tweaking their powerful compute environments to drive efficiency and manageability into a broad range of workloads. What are some HPC capabilities that could also be leveraged by ML environments? Cluster management software for one.
Here are a few HPC capabilities that Bright software for Data Science offers that would liberate both IT and the AI data scientist:
- Flexibility: Bright supports multiple OS distributions, architectures, and frameworks that can be run on bare metal, or scheduled through the HPC workload management system on bare metal, or in Singularity containers. They can also run on physical compute nodes, in virtual machines or they can be run locally or in the cloud. Also, with the familiar Jupyter Notebook, administrators can choose to spawn them through the HPC workload scheduler or Kubernetes.
- Cost Effective: Integrated with all the HPC workload schedulers, Singularity and Kubernetes, user jobs are efficiently run on shared clusters, driving higher utilization of expensive resources such as GPUs. Also, using shared resources is much more cost effective than purchasing expensive workstations for individual researchers. You may also find building models across multiple less expensive nodes with fewer GPUs provides a smoother transition into the future as models become more sophisticated and require more GPUs than can fit into a single node.
- Turn key: Bright software is easy to install and can be deployed in minutes across the entire infrastructure from bare metal. Regular monitoring and health checks of all nodes and critical applications, such as workload schedulers and Kubernetes, display how the resources are bring used, and ensure the user that jobs are going to be complete.
- Extensible: Using EasyBuild, administrators or individual Data Scientist can add new versions of any framework and using Bright doesn't preclude also using the NGC images. These images can be added to the Docker registry and efficiently run through the workload management system in Singularity containers.
- Fully supported: It doesn't matter which supported hardware/OS you choose to run, Bright has your back covered.