How to submit SLURM jobs to cloud nodes using cmsub with Bright

    

Stresscpu is a simple CPU intensive test job that you can use to submit test jobs to your cluster. The distribution includes everything needed to run the job including the submission script and the executable. By default, it submits two test jobs to the SLURM partition specified on the command line. This article describes how to use it to submit jobs to run on cloud-based nodes. 

Let's get started.


Download the distribution.

$ wget http://dl.dropbox.com/u/2999184/stresscpu.tar.gz
--2012-08-07 17:39:11--  http://dl.dropbox.com/u/2999184/stresscpu.tar.gz
Resolving dl.dropbox.com... 107.22.172.16, 107.22.253.68, 184.73.185.158, ...
Connecting to dl.dropbox.com|107.22.172.16|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9384 (9.2K) [application/x-tar]
Saving to: stresscpu.tar.gz

100%[=================================================================>] 9,384       --.-K/s   in 0.09s

2012-08-07 17:39:11 (100 KB/s) - stresscpu.tar.gz

Decompress and unarchive the distribution.

$ tar xvzf stresscpu.tar.gz
stresscpu/
stresscpu/stresscpu.sh
stresscpu/spool/
stresscpu/script/
stresscpu/stresscpu.submit
stresscpu/bin/
stresscpu/bin/stresscpu
stresscpu/results/

Before we start, let's verify that no jobs are running. We can see this from the command line, or using the Bright CMGUI.

$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)

no jobs running

 

Let's submit some jobs to the cloud partition. By default, the stresscpu.submit script supports submitting jobs to partitions named "defq" and "cloud". We'll submit to the cloud partition, which contains only nodes running in the cloud.

$ stresscpu/stresscpu.submit cloud

Submitting stresscpu job 1:

#!/bin/sh
echo "This is job 1 of 2"
echo "Running stresscpu for 1 minutes..."
srun /home/rstober/stresscpu/stresscpu.sh 1

  Upload job id: 322
    User job id: 323
Download job id: 324


Submitting stresscpu job 2:

#!/bin/sh
echo "This is job 2 of 2"
echo "Running stresscpu for 4 minutes..."
srun /home/rstober/stresscpu/stresscpu.sh 4

  Upload job id: 325
    User job id: 326
Download job id: 327

The jobs were successfully submitted and are now running on cloud nodes catom02 and catom03. This can be more clearly seen in the Bright CMGUI.The stresscpu.submit script submits the jobs to SLURM using the Bright cmsub program, which takes care of uploading the required input files and downloading desired output files. Please see the cmsub article for details.

$  squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    323     cloud stresscp  rstober  PD       0:00      1 (Dependency)
    326     cloud stresscp  rstober  PD       0:00      1 (Dependency)
    324 cloudtran stresscp  rstober  PD       0:00      1 (Dependency)
    327 cloudtran stresscp  rstober  PD       0:00      1 (Dependency)
    322 cloudtran stresscp  rstober   R       0:03      1 atom-head1
    325 cloudtran stresscp  rstober   R       0:03      1 atom-head1

 cloud jobs submitted

The cmsub program manages uploading the required input files to the cloud director, automatically mirroring the local directory structure as shown below. The files stresscpu.sh and stresscpu were uploaded to the execution directory (/home/rstober/stresscpu) on the cloud director by the upload job.


The cloud director shares the user home directories to the cloud nodes in the Amazon EC2 region its running in. Each job's output files (slurm-323.out and slurm-326.out) are also being written in the execution directory.

$ find stresscpu
stresscpu
stresscpu/spool
stresscpu/spool/323
stresscpu/spool/323/cpu0
stresscpu/spool/323/stresscpu.sh.status
stresscpu/spool/326
stresscpu/spool/326/cpu0
stresscpu/slurm-326.out
stresscpu/slurm-323.out
stresscpu/bin
stresscpu/bin/stresscpu
stresscpu/stresscpu.sh

Once the jobs are completed the output files are downloaded to the submission directory.

$ ls -l stresscpu
total 32
drwxrwxr-x 2 rstober rstober 4096 Nov 13  2011 bin
drwxrwxr-x 2 rstober rstober 4096 Aug  7 17:39 results
drwxrwxr-x 2 rstober rstober 4096 Aug  7 19:33 script
-rw-rw-r-- 1 rstober rstober    0 Aug  7 19:33 slurm-322.out
-rw-rw-r-- 1 rstober rstober  313 Aug  7 19:36 slurm-323.out
-rw-rw-r-- 1 rstober rstober    0 Aug  7 19:36 slurm-324.out
-rw-rw-r-- 1 rstober rstober    0 Aug  7 19:33 slurm-325.out
-rw-rw-r-- 1 rstober rstober  210 Aug  7 19:38 slurm-326.out
-rw-rw-r-- 1 rstober rstober    0 Aug  7 19:38 slurm-327.out
drwxrwxr-x 4 rstober rstober 4096 Aug  7 17:39 spool
-rwxr-xr-x 1 rstober rstober 1577 Aug  7 17:19 stresscpu.sh
-rwxr-xr-x 1 rstober rstober 2133 Aug  7 19:33 stresscpu.submit

The job output shows that it completed successfully.

$ cat stresscpu/slurm-326.out
This is job 2 of 2
Running stresscpu for 4 minutes...
stresscpu.sh: starting
stresscpu.sh: found 1 processors
Still 4 minute(s) to go.
Still 3 minute(s) to go.
Still 2 minute(s) to go.
Still 1 minute(s) to go.
...

Bright OpenStack