Slurm 101: Basic Slurm Usage for Linux Clusters

     

This article describes basic Slurm usage for Linux clusters. Brief "how-to" topics include, in this order:

  • A simple Slurm job script
  • Slurm workload managerSubmit the job
  • List jobs
  • Get job details
  • Suspend a job (root only)
  • Resume a job (root only)
  • Kill a job
  • Hold a job
  • Release a job
  • List partitions
  • Submit a job that's dependant on a prerequisite job being completed
Are you a cluster admin? Download our eBook on using Slurm »
 
OK. Let's get started.

Here's a simple Slurm job script:


$ cat slurm-job.sh
#!/usr/bin/env bash

#SBATCH -o slurm.sh.out
#SBATCH -p defq

echo "In the directory: `pwd`"
echo "As the user: `whoami`"
echo "write this is a file" > analysis.output
sleep 60

Submit the job:

$ module load slurm
$ sbatch slurm-job.sh
Submitted batch job 106

List jobs:

$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    106 defq      slurm-jo  rstober   R   0:04      1 atom01

Get job details:

$ scontrol show job 106
JobId=106 Name=slurm-job.sh
   UserId=rstober(1001) GroupId=rstober(1001)
   Priority=4294901717 Account=(null) QOS=normal
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=00:00:07 TimeLimit=UNLIMITED TimeMin=N/A
   SubmitTime=2013-01-26T12:55:02 EligibleTime=2013-01-26T12:55:02
   StartTime=2013-01-26T12:55:02 EndTime=Unknown
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=defq AllocNode:Sid=atom-head1:3526
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=atom01
   BatchHost=atom01
   NumNodes=1 NumCPUs=2 CPUs/Task=1 ReqS:C:T=*:*:*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/rstober/slurm/local/slurm-job.sh
   WorkDir=/home/rstober/slurm/local

Suspend a job (root only):

# scontrol suspend 135
# squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    135  defq simple.s  rstober  S   0:10   1    atom01

Resume a job (root only):

# scontrol resume 135
# squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    135  defq simple.s  rstober  R   0:13   1    atom01

Kill a job. Users can kill their own jobs, root can kill any job.

$ scancel 135
$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

Hold a job:

$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      defq   simple  rstober  PD       0:00      1 (Dependency)
    138      defq   simple  rstober   R       0:16      1 atom01
$ scontrol hold 139
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      defq   simple  rstober  PD       0:00      1 (JobHeldUser)
    138      defq   simple  rstober   R       0:32      1 atom01

Release a job:

$ scontrol release 139
$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    139      defq   simple  rstober  PD       0:00      1 (Dependency)
    138      defq   simple  rstober   R       0:46      1 atom01

List partitions:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
defq*        up   infinite      1  down* atom04
defq*        up   infinite      3   idle atom[01-03]
cloud        up   infinite      2  down* cnode1,cnodegpu1
cloudtran    up   infinite      1   idle atom-head1

Submit a job that's dependant on a prerequisite job being completed:

Here's a simple job script. Note that the Slurm -J option is used to give the job a name.

#!/usr/bin/env bash

#SBATCH -p defq
#SBATCH -J simple

sleep 60

Submit the job

$ sbatch simple.sh
Submitted batch job 149

Now we'll submit another job that's dependent on the previous job. There are many ways to specify the dependency conditions, but the "singleton" is the simplest. The Slurm -d singleton argument tells Slurm not to dispatch this job until all previous jobs with the same name have completed.

$ sbatch -d singleton simple.sh
Submitted batch job 150
$ squeue
  JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
    150 defq   simple  rstober  PD  0:00  1 (Dependency)
    149 defq   simple  rstober   R  0:17  1 atom01

Once the prerequisite job finishes the dependent job is dispatched.

$ squeue
  JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
    150 defq   simple  rstober   R   0:31  1 atom01
simple management of complex resources