Welcome to NVIDIA Bright Cluster Manager 9.2—the latest version of our market-leading software for building HPC Linux clusters from edge-to-core-to-cloud. 9.2 delivers even more cluster management capabilities that eliminate complexity and enable flexibility by combining provisioning, monitoring, and management capabilities in a single tool that spans the entire lifecycle of your Linux cluster. With Bright Cluster Manager, administrators can provide better support to end-users and your business. Release 9.2 takes simplified cluster management to a whole new level, here are some of the key features:
Encryption of data at rest is an important component of a more comprehensive security strategy. It adds a layer of in-depth defense that protects your data if the physical device that your data resides on falls into the wrong hands by making it impossible to understand or decrypt your data without also having access to your encryption keys. Encryption also acts as a security checkpoint. Since the encryption keys are centrally managed in Bright, access to data is enforced and can be audited from within Bright.
Disk encryption is one of the mechanisms Bright 9.2 provides to secure your data. You can encrypt the disks of head nodes and compute nodes, including on-premises, edge, and cloud nodes. The head node disk can be encrypted during head node installation, while the disks of all other types of cluster nodes can be encrypted during cluster installation or at any time thereafter.
To further enhance security, SELinux and FIPS can now be enabled or disabled very easily on a per node or per node category basis.
GigaIO’s FabreX technology enables an entire server rack to be treated as a single compute resource. All resources, normally located inside the server (GPUs, FPGAs, storage), can now be pooled in accelerator or storage enclosures where they are available to all the servers in the rack. Accelerator and storage enclosures continue to communicate over native PCIe, as they would if they were still plugged into the server motherboard, for the lowest possible latency and highest bandwidth performance.
In Bright 9.2, Bright has integrated with GigaIO FabreX such that nodes can be “composed” using CMSH or BrightView.
By default, the Bright cloud wizard creates new virtual network resources in the cloud. (e.g. VPCs, VNETs, security groups, etc.) This provides quick and easy cloud deployments. But some organizations may be required to use existing resources. For example, resources created by their network/security teams. Other examples include the use of AWS Direct Connect or Azure ExpressRoute.
Bright already supported such scenarios, but in 9.2 the cloud wizards have been redesigned to streamline such deployments.
Bright has traditionally made it very easy to deploy clusters in the cloud from a Bright provided cloud image. However, many organizations are required to use certain global IT approved cloud disk images (e.g., AMIs or VMs) or specific OS distributions for which Bright does not publish an image. This feature allows organizations to start with a global IT approved cloud disk image and then add the Bright software to it, resulting in a Bright head node in the cloud that uses the approved cloud disk image. Similarly, the Bright software image that is used for provisioning compute nodes is also built from the IT approved disk cloud image.
This feature uses Bright’s Ansible integration, which has been extended in Bright 9.2 to include a new Ansible Bright head node installer role. This functionality will be available on all of the OS distributions supported by Bright and will be backported to Bright 9.1.
Many years of experience using cloud servers rather than on-premise servers to run computationally intensive workloads has demonstrated that cloud servers are just as likely, if not more so, to fail.
A new feature in Bright 9.2 allows organizations using Bright’s Cluster as a Service (CaaS) with AWS to eliminate the head node as a single point of failure by standing up High Availability (HA) head nodes in the cloud so that a cluster continues to run even if the primary head node fails. Support for Azure based HA head nodes will follow in the future.
Bright collects hundreds of metrics from cluster devices, jobs, and subsystems and allows administrators to visualize them in Bright View’s monitoring screens. In Bright 9.2, information that is commonly needed by administrators will be automatically collected and presented by default.
For example, when an administrator selects a node category, the overview tab for that category will show useful metrics that pertain to the group, such as CPU utilization, memory utilization, job slot utilization, health check information, etc.
Bright’s integration with Jupyter makes it possible to schedule kernels in Kubernetes or an HPC workload management system such as Slurm through Jupyter Notebooks. In Jupyter, users can create kernel definitions by instantiating a kernel template where you specify information such as the number of GPUs that should be allocated when the kernel will be launched, or the job queue that a kernel should be submitted to. The Bright-Jupyter integration also works seamlessly with Bright’s pre-certified and supported machine learning packages.
Bright’s machine learning packages are tested and certified for NVIDIA DGX nodes. You have a choice of running Bright's machine learning packages on bare metal, or running NVIDIA NGC container images either through Kubernetes or using Pyxis/enroot through Slurm. In previous Bright versions, to use NGC containers with Jupyter, several packages and files need to be added to NGC containers to allow Jupyter to interact with them. Also, Jupyter kernel modules should be installed, as well as a Bright adapter (ipykernel-k8s.py). Bright 9.2 provides a Kubernetes operator that will eliminate the need for the Bright ipykernel-k8s.py adapter.
Since several NGC container images already contain the Jupyter kernel module, with Bright’s Jupyter Kubernetes operator, it is now possible to use NGC container images without changes in Jupyter setup or in the container images. The only requirement is that the NGC container images contain the Jupyter kernel modules, as is often the case.
Kubernetes has deprecated Docker as a Container Runtime Interface (CRI) with v1.20. This feature allows organizations to use containerd as the Kubernetes CRI. In addition, several enhancements have been made to configure Pod Security Policies (PSP) to restrict operations that users may perform in their containers.
Red Hat announced that it will discontinue CentOS 8 by the end of 2021 and will instead focus on CentOS Stream going forward. CentOS 7 will continue to be updated until 2024 and is therefore not affected by this change.
In Bright 9.2, we will provide Rocky Linux as a replacement for CentOS 8. Customers will be able to deploy RHEL 8 compatible clusters using Rocky Linux 8. In addition, we will provide a Knowledge Base article that explains the steps an administrator can take to convert an existing Bright 9.0/9.1 running CentOS 8 cluster to Rocky 8.
In Bright 9.2, we have extended our RedFish integration to handle firmware management. As a result, organizations can use RedFish to deploy firmware upgrades for various components to cluster nodes with HPE Integrated Lights-Out (iLO) 5 controllers.
Bright 9.2 allows Multi-Instance GPU (MIG) configurations to be created for individual nodes or for entire categories of nodes. This allows MIG compatible GPUs to be partitioned such that GPU instances can be allocated to different jobs that run in parallel.
Bright 9.2 now allows unprivileged users to run containerized workload through the Slurm workload management system by leveraging NVIDIA’s Pyxis & enroot components. Prior to this integration, it was already possible to deploy Pyxis and enroot by following a set of manual steps, but the installation procedure has now been integrated in the setup process of the Slurm workload management system.
Bright’s BeeGFS integration has been extended to allow multiple BeeGFS instances to be deployed and managed on a single Bright cluster.