Achieving competitive advantage in the cloud with: Bright Cluster Manager

page_header_divider_line

As traditional mindsets towards high-performance computing (HPC) increasingly give way to cloud-based approaches, organizations are seizing the advantages of time and reduced capital outlays compared to traditional on-premise approaches. Leveraging HPC in the cloud has important business applications for companies trying to get into HPC and haven’t yet decided, or can’t afford, to deploy an HPC cluster. Cloud technology means that companies can now adapt and scale more easily to accelerate innovation, drive business agility, streamline operations, and potentially reduce costs.

This document describes how to deploy a Bright Cluster as a Service environment in AWS, how to configure marketplace licensing, and how to configure Bright’s Auto Scaler to automatically start and stop cloud nodes as needed.

page_section_divider_line

Table of contents

page_section_divider_line

Deploy a CaaS Cluster in aws

Login to the Bright customer portal at http://customer.brightcomputing.com. You will automatically be redirected to a secure login form. Enter your username and password to authenticate.

If you do not yet have an account you can create one by clicking on "Create a new account".

Select "Cluster on Demand" (CoD) from the menu on the customer portal's home page.

If you do not yet have a product key and you want a commercially supported solution, it's best to get a free CoD product key using this link. A free CoD license is valid for one head node for one year, and it includes technical support. So this is a great option if you have work to do and need enterprise grade support.

Since you’re logged into the customer portal, your personal information is automatically populated into the fields. All you need to do is to agree to the Bright Computing Terms and Conditions, and press the "Request" button.

Upon completion, you will receive a notification saying, "Request was sent successfully".

While you wait for the product key to arrive in your inbox, return to the Cluster on Demand page. Select your preferred cloud provider, the region you want this cluster to be created in, and your cloud provider credentials.

Once the email arrives, copy your free product key and paste it into the product key field

Now you get to design your cluster. You’ll need to give the cluster a name, select a Bright version, and OS distribution for the cluster. There are four workload management options: Slurm, PBS Professional, PBS Professional CE, and "don't configure". In this example, we’ll choose Slurm.

Moving on down to the head node option, this is the instance type you want to use for your head node. The default is m3.medium, which is an economical choice suitable for small clusters. But all the instance types that are available in the selected region are available in the select list.

Next, select a head node disk type and size. "gp2" is the default, and provides a general purpose SSD. But you can also choose io1, st1, or sc1. 50GB is the default size, but you change it to whatever size you need, subject to the maximum size supported by your selected head node disk type. We’ll set the disk size to 120GB.

Then, enter the number of compute nodes you want the cluster to initially create. Four is the default. But bear in mind that you can easily create additional compute nodes after the cluster is deployed.

Then, select the instance type you want to use for the compute nodes. All of the instance types that are available in the selected region are in the select list. We’ll select g4dn.xlarge, which provides a single T4 GPU.

Next, specify how you’re going to authenticate to the cluster, once it's deployed. You have two choices: you can set a password or provide an SSH public key. We’ll provide a public SSH key as it’s more secure.

And finally, accept the Bright Computing end user license agreement, and press the submit button.

 

In just a few minutes you’ll receive an email indicating that your cluster has been created. Also, note that the email provides a link to the Bright Knowledge Base article that describes how to delete a cluster that you have created from the customer portal using the cloud provider's console or portal.

You can now login as root using the IP number shown in the email. 

$ ssh root@54.72.69.74

Once you’re logged in, set a password for the root user so that you can log in using Bright View. This step is not required if you selected password authentication during cluster installation.

Login to the cluster using Bright View, Bright’s HTML5 web administration portal.

https://54.72.69.74:8081/bright-view/

This warning is expected because the head node is using a self-signed SSL certificate. You can resolve this by adding an SSL certificate from one of the recognized providers, but most sites don't because they understand why they are getting the warning and that it's not insecure. The communications are being encrypted using industry standard SSL.

Press the Advanced button to continue.

And then you need to click the "proceed" link.

And now you are presented with the bright View login dialog box. Enter "root" in the username field, and the password you set in the shell in the password field. Then press "Login" to authenticate the Bright View.

The next step is to power on and provision the cloud compute nodes. Select “Grouping”, then “Categories” from the resource tree, as shown above.

Select the action button for the default node category, as shown above. Then select power -> On from the action menu, as shown below.

Then press “Confirm” to power on the compute nodes.

The compute nodes are now being provisioned.

After a few minutes all of the compute nodes have been provisioned and are UP. But there’s no need to wait for the compute node to be provisioned before continuing with the configuration.

 

configure bright to use marketplace pricing

Now that the compute nodes are being provisioned, let’s configure if and when Bright will use the paid marketplace images. Select Cloud -> AWS -> AWS Settings.

Then click on the action button for the AWS provider. You will be presented with the settings page for the AWS cloud provider, as shown below. Then click on the arrow to the right of “USE PAID MARKETPLACE AMIS” to expand the topic.

This is the setting that controls if, and when, Bright will use the paid marketplace AMI. The default is “NEVER”, meaning that, by default, Bright will only start cloud nodes up to the number provided by your license.

Bright provides two cloud disk images for each public cloud provider, one is the "normal" image that uses unused Bright licenses you already have, and the other is a paid marketplace image. The paid marketplace image is for customers who do not have sufficient unused Bright licenses.

When the paid marketplace image is used, the bill you receive from your cloud provider will include a small hourly charge for the Bright software license it includes. 

For example, let's assume you have 100 Bright licenses, and that you are using 90 of them for your on-prem nodes. That means that you have 10 unused Bright software licenses. You can then start 10 cloud nodes - one cloud director and nine cloud compute nodes - all using the normal cloud image. 

You can run those 10 cloud nodes 24/7/365 and you will never incur any additional licensing costs from Bright for those 10 nodes. The cloud provider will still bill you, but that bill will not include any charges from Bright Computing.

But if you need to start 20 cloud nodes, and you have selected “AS_NEEDED” in the “USE PAID MARKETPLACE AMIS” select list, Bright will start all 20 cloud nodes and the bill you receive from your cloud provider will include a small hourly charge for the Bright software licenses for the 10 nodes you are running over and above what you're licensed for. 

We will select “AS_NEEDED”, since it’s key to this strategy, and then we’ll press “SAVE” to save our changes.

 

Configure Bright's auto scaler

Configuring Auto Scaler can seem complicated to administrators who are not familiar with it we will break it down into four steps: 

  • Configure an Auto Scaler Configuration Overlay
  • Configure a Resource Provider
  • Configure a Workload Engine
  • Configure a Workload Tracker

Configure an Auto Scaler Configuration Overlay

Why use a configuration overlay?

Auto Scaler is assigned and configured as a Bright role. Roles can be assigned to individual nodes, or to groups of nodes using node categories or configuration overlays. Configuration overlays allow you to assign all head nodes to the overlay, so that if High Availability (HA) is configured, Auto Scaler is also automatically HA. The service is automatically configured to run on the ACTIVE head node

 

As the root user, start the Cluster Manager Shell (CMSH).

[root@CaaS-Demo ~]# cmsh

 

Enter configuration overlay mode

[CaaS-Demo]% configurationoverlay

 

Create a new configuration overlay for the Auto Scaler service. 

[CaaS-Demo->configurationoverlay]% add auto-scaler

 

Configure the auto-scaler configuration overlay to run on all head nodes 

[CaaS-Demo->configurationoverlay*[auto-scaler*]]% set allheadnodes yes

 

This is what the auto-scaler configuration overlay should look like at this point.

[CaaS-Demo->configurationoverlay*[auto-scaler*]]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Name                             auto-scaler
Revision
All head nodes                   yes
Priority                         500
Nodes
Categories
Roles                            scaleserver
Customizations                   <0 in submode>

 

Configure a resource provider

Enter the ‘roles’ sub-mode and assign the ‘scaleserver’ role

[CaaS-Demo->configurationoverlay*[auto-scaler*]]% roles
[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles]% assign scaleserver

 

Turn on debug logging. This is especially useful when configuring Auto Scaler because it causes the cm-scale daemon to write more verbose log files.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]]% set debug yes
[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]]% set runinterval 60

 

This is what the scalserver role should look like at this point:


[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Name                             scaleserver
Revision
Type                             ScaleServerRole
Add services                     yes
Provisioning associations        <0 internally used>
Engines                          <0 in submode>
Resource Providers               <0 in submode>
Dry Run                          no
Debug                            yes
Run Interval                     60
Advanced Settings                

 

Next, we’ll configure a resource provider. A resource provider is a pool of nodes that Auto Scaler can allocate to workload.

Enter ‘resourceproviders’ sub-mode, and then add a dynamic resource provider. A dynamic resourceprovider is typically used in the cloud, because it can create additional cloud compute nodes by cloning an existing cloud node.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]]% resourceproviders
[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->resourceproviders]% add dynamic aws


The ‘templatenode’ is the hostname of the node Auto Scaler will clone if it determines that you need additional cloud compute nodes. We’ll set it to the name of our first compute node, cnoode001.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->resourceproviders*[aws*]]% set templatenode cnode001

 

The ‘noderange’ is the range of nodes that Auto Scaler will use and control. We only have four compute nodes, but because we anticipate that we might need as many as 10, we’ll set this to cnode002 through cnode010.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->resourceproviders*[aws*]]% set noderange cnode002..cnode010

 

By default, Auto Scaler will try to use the tun0 interface. But since this is a CaaS cluster in AWS the cloud compute nodes have eth0 instead of tun0. So we’ll set ‘networkinterface’ to eth0.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->resourceproviders*[aws*]]% set networkinterface eth0

 

This is what the AWS resource provider should look like now.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->resourceproviders*[aws*]]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Name                             aws
Revision
Type                             dynamic
Enabled                          yes
Priority                         0
Whole Time                       0
Stopping Allowance Period        0
Keep Running
Extra Nodes
Extra Node Idle Time             3600
Extra Node Start                 yes
Extra Node Stop                  yes
Allocation Prolog
Allocation Epilog
Allocation Scripts Timeout       10
Template Node                    cnode001
Node Range                       cnode002..cnode010
Network Interface                eth0
Start Template Node              no
Stop Template Node               no
Remove Nodes                     no
Leave Failed Nodes               yes
Never Terminate                  32
Default Resources                cpus=1

 

Note that the “start template node” and “stop template node” parameters are set to “no”. This allows the administrator to manually control node cnode001, but it also means that Auto Scaler really only has nine nodes under its control. If you really want Auto Scaler to control 10 cloud nodes you could either set “start template node” and “stop template node” to “yes”, or you could set the node range to cnode002..cnode011.

 

Configure a workload engine

Now let’s configure a workload engine. Type exit twice to get back to the top-level of the scaleserver role.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->resourceproviders*[aws*]]% exit; exit

 

Then type ‘engines’ to enter the engines sub-mode.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]]% engines

 

Add an HPC type engine named “slurm”. An HPC type engine is generally used with HPC workload managers such as Slurm, PBS Pro, OpenPBS, UGE and LSF. The name “slurm” is just a name, but as a best practice, it’s best to name the engine for the workload manager it represents.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines]% add hpc slurm

 

Set ‘WLM cluster’ to “slurm”, which is the name of the Slurm instance this engine represents.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]]% set wlmcluster slurm

 

This is what the slurm engine should look like now.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]]% show
Parameter                        Value
-------------------------------- ------------------------------------------------
Name                             slurm
Revision
Type                             hpc
Trackers                         <0 in submode>
Workloads Per Node               1
Priority                         0
Age Factor                       1.000000
Engine Factor                    1.000000
External Priority Factor         0.000000
WLM cluster                      slurm


 

Configure a workload tracker

A tracker is an Auto Scaler object that watches the workload in a specified workload manager queue or Kubernetes namespace. Since we added an HPC type engine, we can only add a queue type tracker. Let’s add it now.

From the engines sub-mode, type ‘trackers’ to enter the trackers sub-mode.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]]% trackers

 

Then add the tracker. We’ll add a queue type tracker named ‘defq’, which is the name of the Slurm partition (queue) we want Auto Scaler to watch. The name is only a name; we could have named it anything. But the practice is to name the tracker the same as the queue the tracker will watch.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]->trackers]% add queue defq

 

The ‘queue’ parameter is what actually tells Auto Scaler that we want it to watch for pending jobs in the defq. So we’ll set queue to “defq”.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]->trackers*[defq*]]% set queue defq

 

We’ll set the ‘Primary Overlays’ parameter to “slurm-client” so that any nodes Auto Scaler creates will be added to the slurm-client configuration overlay, and will therefore run Slurm jobs.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]->trackers*[defq*]]% set primaryoverlays slurm-client

 

And we’ll set ‘Allowed resource providers” to “aws”, the name of our only resource provider.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]->trackers*[defq*]]% set allowedresourceproviders aws

 

Now we’ll commit our changes. This will save our changes, and it will cause the Bright CMDaemon to start the cm-scale daemon on the active head node.

[CaaS-Demo->configurationoverlay*[auto-scaler*]->roles*[scaleserver*]->engines*[slurm*]->trackers*[defq*]]% configurationoverlay commit


Conclusion

In this example, Bright Computing has provided a proven method for companies on the outside of HPC looking in to affordably leverage cloud resources and achieve tangible competitive advantage. With this configuration, you will be able to use any number of modern, powerful compute servers when you need them, without purchasing any of them. You only have the operational expense of running one head node in the cloud一and you don’t need to buy any Bright subscription licenses either. The head node license is free, and there is a small hourly charge for the Bright Cluster manager licenses, which is included in the cloud provider’s bill一even then, you only pay for what you actually use.

Note: This strategy may not work if your workload is constant. In that case, it would be cheaper to purchase the Bright subscription licenses up front. But for sites that have intermittent workload一even if it’s a lot of workload一this is the best strategy we’ve seen.

If you have any questions about how to set up a similar cloud-based solution using Bright Cluster Manager, or if you would like a demo, please fill out the form below.

Back to Top

Contact bright