How To Configure Bright Auto-Scaler to Achieve True Convergence of HPC and AI

    

Auto-scaling HPC and Kubernetes

Bright CEO Bill Wagner’s recent blog post “The Convergence of HPC and A.I.” pointed out that creating HPC and Kubernetes silos within a shared HPC infrastructure is more like “coexistence than convergence”. The solution I will be highlighting  is Bright Auto-scaler, which automatically resizes HPC and Kubernetes clusters (workload engines) according to workload demand and configured policies. This post describes how to configure Bright Auto-scaler to achieve true convergence of HPC and A.I. through a scenario.

Scenario description

While our ultimate goal is to resize, before we can configure Bright Auto-scaler we need to think about what we are trying to accomplish. For this scenario, let’s assume that the Kubernetes jobs are a higher priority than the HPC jobs. Meaning that when there is contention - when there are pending HPC and Kubernetes jobs - Bright Auto-scaler should allocate available nodes to Kubernetes first. If there are no pending Kubernetes jobs but there are pending HPC jobs, then we want Auto-scaler to allocate available nodes to the HPC scheduler.

Add a resource provider

Bright Auto-scaler allocates nodes from resource providers, which can be local nodes (static type) or cloud nodes (dynamic type). In this scenario we will create a static resource provider. We start by creating a static resource provider arbitrarily named “pool”, and we assign to it the nodes: node005 and node006. The priority is unimportant in this scenario, but it can be very useful when there are more than one resource providers because Auto-scaler allocates nodes from the highest priority resource providers first.

[rms-aws-demo->device[rms-aws-demo]->roles[scaleserver]% resourceproviders
[rms-aws-demo->device[rms-aws-demo]->roles[scaleserver]->resourceproviders]%
add static pool [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->resourceprovider
s*[pool*]]% set nodes node005..node006 [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->resourceprovider
s*[pool*]]% set priority 100 [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->resourceprovider
s*[pool*]]% show Parameter Value -------------------------------- ------------------------------------------- Revision Type static Nodes node005..node006 Name pool Enabled yes Nodegroups Priority 100 Keep Running Whole Time 0 Stopping Allowance Period 0 Extra Nodes Extra Node Idle Time 3600 Extra Node Start yes Extra Node Stop yes Default Resources cpus=1

 

Add a Kubernetes workload engine

To create the required configuration we add a Kubernetes type workload engine. Since we want Auto-scaler to allocate resources to the Kubernetes workload engine first, we will set its priority to something higher than the priority of the existing HPC engine. We also set the cluster parameter to the name of the existing Kubernetes cluster.

[rms-aws-demo->device[rms-aws-demo]->roles[scaleserver]->engines]% add
kubernetes k8s
[rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->engines*[k8s*]]%
set priority 100 [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->engines*[k8s*]]%
set cluster default

Add a Kubernetes namespace tracker

Bright auto-scaler trackers retrieve information from HPC queues and Kubernetes namespaces. Since this is a Kubernetes tracker we’ll add a tracker of type namespace. By convention, the name of the tracker should be the same as the name of the Kubernetes namespace it tracks, but that is not a requirement. 

[rms-aws-demo->device[rms-aws-demo]->roles[scaleserver]->engines[k8s]]%
trackers [rms-aws-demo->device[rms-aws-demo]->roles[scaleserver]->engines[k8s]->tracke
rs]% add namespace default [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->engines*[k8s*]->
trackers*[default*]]% show Parameter Value -------------------------------- -------------------------------------------- Revision Type namespace Name default Controller Namespace Object job Enabled yes Assign Category Primary Overlays Queue Length Threshold 0 Age Threshold 0

Next we set the tracker’s controllernamespace parameter to the name of the Kubernetes namespace we want it to track, which is the default namespace in this case. And finally, we set the primaryoverlays parameter to a list of configuration overlays. In this case it is a list of one, kube-default-worker. 

When Auto-scaler allocates a node to this Kubernetes namespace, it removes any overlays it previously added, and then adds the kube-default-worker overlay. 

[rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->engines*[k8s*]->
trackers*[default*]]% set controllernamespace default [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->engines*[k8s*]->
trackers*[default*]]% set primaryoverlays kube-default-worker [rms-aws-demo->device*[rms-aws-demo*]->roles*[scaleserver*]->engines*[k8s*]->
trackers*[default*]]% show Parameter Value -------------------------------- ------------------------------------------- Revision Type namespace Name default Controller Namespace default Object job Enabled yes Assign Category Primary Overlays kube-default-worker Queue Length Threshold 0 Age Threshold 0

Add configuration overlays for the HPC queues

Auto-scaler can operate using node categories or configuration overlays, but you cannot mix them. Since auto-scaling Kubernetes requires configuration overlays, we must also assign the workload manager roles using configuration overlays. Here are the three overlays that will be used.

[rms-aws-demo->configurationoverlay]% list
name (key)           nodes                     roles
-------------------- ------------------------- ------------------------------
. . .
kube-default-worker                            Docker::Host, Kubernetes::Api+
slurm-jaguar         node001,node002           slurmclient
slurm-shark          node003,node004           slurmclient

The kube-default-worker overlay was created by the Bright Kubernetes deployment wizard. You will need to create the slurm-jaguar and slurm-shark overlays. The names slurm-jaguar and slurm-shark are arbitrary. When Auto-scaler allocates a node to a workload engine it assigns it to one of these configuration overlays. For example, when Auto-scaler allocates a node to a Slurm workload engine for a job pending in the Slurm shark queue, the allocated node(s) will be added to the slurm-shark configuration overlay, and will begin to run the slurm daemon serving the shark queue.

Add a configuration overlay for slurm-jaguar as shown below, then create the slurm-shark overlay in the same way. In each case you should set the nodes parameter to the list of nodes that should always be assigned to the queue. 

[rms-aws-demo->configurationoverlay]% add slurm-jaguar
[rms-aws-demo->configurationoverlay*[slurm-jaguar*]]% set nodes
node001,node002
[rms-aws-demo->configurationoverlay*[slurm-jaguar*]]% roles [rms-aws-demo->configurationoverlay*[slurm-jaguar*]->roles]% assign
slurmclient
[rms-aws-demo->configurationoverlay*[slurm-jaguar*]->roles*[slurmclient*]]%
set queues jaguar [rms-aws-demo->configurationoverlay*[slurm-jaguar*]->roles*[slurmclient*]]%
commit

 

Your finished overlays should look like this.

[rms-aws-demo->configurationoverlay]% show slurm-jaguar
Parameter                        Value
-------------------------------- -----------------------------------------
Categories
Customizations                   <0 in submode>
Name                             slurm-jaguar
Nodes                            node001,node002
Priority                         500
Revision
Roles                            slurmclient

[rms-aws-demo->configurationoverlay]% show slurm-shark
Parameter                        Value
-------------------------------- -----------------------------------------
Categories
Customizations                   <0 in submode>
Name                             slurm-shark
Nodes                            node003,node004
Priority                         500
Revision
Roles                            slurmclient

 

Demonstration

Initial Conditions

The head node and four local compute nodes are UP. Nodes005 and node006 are powered off. 

[rms-aws-demo->device]% status -t physicalnode
node001 .................. [   UP   ] health check failed
node002 .................. [   UP   ] health check failed
node003 .................. [   UP   ] health check failed
node004 .................. [   UP   ] health check failed
node005 .................. [  DOWN  ]
node006 .................. [  DOWN  ]

 

Nodes node001 and node002 are serving the slurm jaguar partition, and nodes node003 and node004 are serving the shark partition. There are no Kubernetes worker nodes.

[rms-aws-demo->configurationoverlay]% list
name (key)           nodes                     roles
-------------------- ------------------------- -----------------------------
. . .
kube-default-worker                            Docker::Host, Kubernetes::Api+
slurm-jaguar         node001,node002           slurmclient
slurm-shark          node003,node004           slurmclient

[root@rms-aws-demo ~]# sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
. . .
shark        up   infinite      2  drain node[003-004]
jaguar       up   infinite      2  drain node[001-002]

 

Submit Kubernetes workload

A user submits several batch jobs and kubernetes jobs. Some of the batch jobs start right away since nodes node003 and node004 are serving the shark queue

[root@rms-aws-demo ~]# squeue
JOBID   PARTITION   NAME       USER ST      TIME      NODES NODELIST(REASON)
369     shark       tf.sl      bob PD       0:00      1 (Resources)
370     shark       tf.sl      bob PD       0:00      1 (Priority)
367     shark       tf.sl      bob  R       8:43      1 node003
368     shark       tf.sl      bob  R       8:43      1 node004

 

Bright Auto-scaler sees the pending HPC and Kubernetes jobs, but since the Kubernetes tracker has a highest priority Auto-scaler maps two of the Kubernetes jobs to the two available pool nodes.

2019-09-12 02:31:09,629 [DEBUG] ----------------  MAPPING  ---------------
2019-09-12 02:31:09,629 [DEBUG] [M] node005: 
k8s/Job:97169d13-d4f4-11e9-8313-fa163ee666ac 2019-09-12 02:31:09,629 [DEBUG] [M] node006:
k8s/Job:971443d0-d4f4-11e9-8313-fa163ee666ac

 

Auto-scaler adds the two mapped nodes to the kube-default-worker configuration overlay and powers them on.

2019-09-12 02:31:09,629 [DEBUG] ----------------  ACTIONS  ---------------
2019-09-12 02:31:09,629 [INFO] Number of executing operations: 3
2019-09-12 02:31:09,629 [DEBUG] [A] add node005 to overlay 
kube-default-worker (workload) 2019-09-12 02:31:09,629 [DEBUG] [A] add node006 to overlay
kube-default-worker (workload) 2019-09-12 02:31:09,629 [DEBUG] [A] power on node005 (workload) 2019-09-12 02:31:09,629 [DEBUG] [A] power on node006 (workload) 2019-09-12 02:31:09,765 [INFO] Node node005 operation (power on) result: done 2019-09-12 02:31:09,765 [INFO] Node node006 operation (power on) result: done

 

In short order we see the nodes are UP.

Thu Sep 12 02:33:31 2019 [notice] rms-aws-demo: node005 [   UP   ]
Thu Sep 12 02:33:31 2019 [notice] rms-aws-demo: node006 [   UP   ]

 

And have been added to the Kubernetes cluster.

[root@rms-aws-demo ~]# kubectl get nodes
NAME      STATUS   ROLES    AGE     VERSION
node005   Ready       5m54s   v1.12.10
node006   Ready       5m54s   v1.12.10

 

Some of the Kubernetes jobs are now running.

[bob@rms-aws-demo ~]$ kubectl get pods
NAME                  READY   STATUS              RESTARTS   AGE
tf7cihnlnw-zrcmm   0/1     ContainerCreating   0          3m42s
tfjc40tcs9-b4xfd   0/1     ContainerCreating   0          3m42s
tfmknfr4t1-j4z76   1/1     Running             0          3m42s
tfyedc7qnd-p5jq6   1/1     Running             0          3m42s

[root@rms-aws-demo ~]# kubectl get pods
NAME                  READY   STATUS      RESTARTS   AGE
tf7cihnlnw-zrcmm   0/1     Completed   0          9m42s
tfjc40tcs9-b4xfd   0/1     Completed   0          9m42s
tfmknfr4t1-j4z76   0/1     Completed   0          9m42s
tfyedc7qnd-p5jq6   0/1     Completed   0          9m42s

 

The Kubernetes jobs are now done and there are no pending HPC jobs.

[bob@rms-aws-demo ~]$ squeue
JOBID   PARTITION     NAME      USER ST      TIME      NODES NODELIST(REASON)
369     shark        tf.sl      bob  R       0:22      1 node004
370     shark        tf.sl      bob  R       0:22      1 node003

 

Auto-scaler powers off node005 and node006 returning them to the pool. It is also possible to configure Auto-scaler to leave idle nodes running.

2019-09-12 02:45:37,957 [DEBUG] ----------------  ACTIONS  ---------------
2019-09-12 02:45:37,957 [INFO] Number of executing operations: 2
2019-09-12 02:45:37,957 [DEBUG] [A] power off node005 (unused)
2019-09-12 02:45:37,957 [DEBUG] [A] power off node006 (unused)
2019-09-12 02:45:37,957 [DEBUG] Performing operations ...
2019-09-12 02:45:37,983 [INFO] Node node005 operation (power off) result: 
done 2019-09-12 02:45:37,983 [INFO] Node node006 operation (power off) result:
done