Current version of the intro to the DCC slides.
Introduction to SLURM
Most DCC partitions are lab-owned machines. These can only be used by members of the group. Submitting to a group partition gives “high-priority”.
Submit to partitions with
sbatch –p (partition name) --account=(account name)
(in a job script) or
srun –p (partition name) --account=(account name) --pty bash –i
(interactively) In general, the partition name and account name will be the same for most lab-owned machines.
Common DCC Partitions
There are four different DCC partitions to which batch jobs and interactive sessions can be directed by all DCC users:
- common for jobs that will run on the DCC core nodes (up to 64 GB RAM).
- gpu-common for jobs that will run on DCC GPU nodes.
- scavenger for jobs that will run on lab-owned nodes in “low priority” (kill and requeue preemption).
- scavenger-gpu for GPU jobs that will run on lab-owned nodes in “low priority” (kill and requeue preemption).
Note: These partitions do not require an “–account=” flag. (i.e., they accept jobs from any account.) If a partition is not specified, the default partition is the common partition.
Running an interactive job
Reserve a compute node by typing srun –pty bash -i
tm103@dcc-login-02 ~ $ srun --pty bash -i
srun: job 186535 queued and waiting for resources
srun: job 186535 has been allocated resources
tm103@dcc-core-11 ~ $
tm103@dcc-core-11 ~ $ squeue -u tm103
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
186535 common bash tm103 R 0:14 1 dcc-core-11
I now have an interactive session in the common partition on node dcc-core-11
SLURM commands
sbatch
– Submit a batch job
#SBATCH
– Specify job parameters in a job script
squeue
– Show lists of jobs
scancel
– Delete one or more batch jobs
sinfo
– Show info about machines
scontrol
– Show cluster configuration information
Use sbatch
(all lower case) to submit text file job scripts, e.g. test.sh
sbatch test.sh
Use #SBATCH
(upper case) in your scripts for scheduler directives, e.g.
#SBATCH --mem=1G
#SBATCH --output=matlab.out
All SLURM directives can be given on the command line instead of the script. slurm.schedmd.com
Slurm memory directives
The default memory request (allocation) is 2 GB RAM. This is a hard limit, always request a little more. To request a total amount of memory for the job, use one of the following:
* --mem=<MB>
additional memory
* --mem=<Gigabyte>G
the amount of memory required per node, or
* --mem-per-cpu=<MB>
the amount of memory per CPU core, for multi-threaded jobs
Note: –mem and –mem-per-cpu are mutually exclusive
Slurm parallel directives
All parallel directives have defaults of 1
-N <number>
How many nodes (machines)
-n <number>
or --ntasks=<number>
How many parallel jobs (“tasks”)
-c <number>
or --cpus-per-task=<number>
Use -n
and -N
for multi-node jobs (e.g. MPI)
Use -c
(–cpus-per-task) for multi-threaded jobs
Job script examples
#!/bin/bash
#SBATCH ---output=test.out
hostname # print hostname
This prints the name of the compute node in the file “test.out”
tm103@dcc-login-02 ~/slurm $ sbatch simple.sh
Submitted batch job 186554tm103@dcc-login-02 ~/slurm $ cat test.out
dcc-core-14
Long-form commands example
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --mem=100 # 100 MB RAM
#SBATCH --partition=scavenger#
hostname 1>&2 #prints hostname to the error file
This job will run in low priority on a lab node in the “scavenger” partition
Short-form commands example.
SLURM short commands don’t use “=“ signs
#!/bin/bash
#SBATCH -o slurm.out
#SBATCH -e slurm.err
#SBATCH --mem=4G # 4 GBs RAM
#SBATCH –p scavenger
hostname 1>&2 #prints hostname to the error file
R example script
#!/bin/bash
#SBATCH –e slurm.err
#SBATCH --mem=4G # 4 GB RAM
module load R/3.6.0
R CMD BATCH Rcode.R
This loads the environment module for R/3.6.0 and runs a single R script (“Rcode.R”)
The #SBATCH –mem=4G
requests additional RAM
Multi-threaded (multi-core) example
!/bin/bash
#SBATCH –J test
#SBATCH –o test.out
#SBATCH –c 4
#SBATCH –-mem-per-cpu=500 #(500 MB)
myApplication –n $SLURM_CPUS_PER_TASK
The value of $SLURM_CPUS_PER_TASK
is the number after -c
. This example starts a single, multi-threaded job that uses 4 CPU cores and 2 GB (4x500MB) of RAM
OpenMP multicore example
!/bin/bash
#SBATCH –J openmp-test
#SBATCH –o slurm.out
#SBATCH –c 4
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
myOpenMPapp # will run on 4 CPU cores
This sets $OMP_NUM_THREADS
to the value of $SLURM_CPUS_PER_TASK
Slurm job arrays
Slurm job arrays are a mechanism for submitting and managing collections of similar jobs using one job script and one application program.
- Add
--array
or-a
option to the job script - Each job task will inherit a
SLURM_ARRAY_TASK_ID
environment variable with a different integer value - Each job array can be up 100,000 job tasks on the DCC
- Job arrays are only supported for batch jobs
- Job array “tasks” must be independent: slurm.schedmd.com/job_array.html
For example, in a job script, add the line #SBATCH --array=1-30
or, alternatively, #SBATCH -a 1-30
to submit 30 job tasks. The job array indices can also be specified on the command line, e.g.
sbatch -a 1-30 myjob.sh
The index values can be continuous, e.g.
-a 0-31 (32 tasks, numbered from 0,1,2,…,31)
or discontinuous, e.g.
-a 3,5,7-9,12 (6 tasks, numbers 3,5,7,8,9,12)
It can also be a single job task, e.g.
-a 7
The discontinous notation is useful for resubmitting specific job tasks that had previously failed. Each job task is assigned the enviromental variable $SLURM_ARRAY_TASK_ID
set to it’s index value.
tm103@dcc-login-02 ~/misc/jobarrays $ cat array-test.sh
#!/bin/bash
echo $SLURM_ARRAY_TASK_ID
tm103@dcc-login-02 ~/misc/jobarrays $ sbatch -a 1-3 array-test.sh
Submitted batch job 24845830
tm103@dcc-login-02 ~/misc/jobarrays $ ls slurm-24845830*
slurm-24845830_1.out slurm-24845830_2.out slurm-24845830_3.out
tm103@dcc-login-02 ~/misc/jobarrays $ cat slurm-24845830*
1
2
3
tm103@dcc-login-02 ~/misc/jobarrays $
Python job array example
#!/bin/bash
#SBATCH -e slurm_%A_%a.err
#SBATCH -o slurm_%A_%a.out
#SBATCH --array=1-5000
python myCode.py
$ cat test.py
import os
taskID=int(os.environ['SLURM_ARRAY_TASK_ID'])
...
Start 5000 Python jobs, each with a different “taskID”, initialized from $SLURM_ARRAY_TASK_ID
Importing Environmental Variables
Python:
numCPUs=int(os.environ['SLURM_CPUS_PER_TASK'])
taskID=int(os.environ['SLURM_ARRAY_TASK_ID'])
R:
numCPUs <- as.integer(Sys.getenv(SLURM_CPUS_PER_TASK))
taskID <- as.integer(Sys.getenv(SLURM_ARRAY_TASK_ID))
MATLAB:
numCPUs = str2num(getenv('SLURM_CPUS_PER_TASK'))
taskID = str2num(getenv('SLURM_ARRAY_TASK_ID'))
Processing separate input files
Process an existing file list, e.g. files.txt
#!/bin/bash
readarray -t FILES < files.txt
FILENAME=${FILES[(($SLURM_ARRAY_TASK_ID - 1))]}
myapp $FILENAME
Dynamically generate a file list from “ls”
#!/bin/bash
export FILES=($(ls -1 myfile*))
FILENAME=${FILES[(($SLURM_ARRAY_TASK_ID - 1))]}
myapp $FILENAME
Example: Using the taskID as part of the file name and output directory for the case with input file names of the form input1,input2,…,inputN for -a 1-N, e.g.
#!/bin/bash
#SBATCH -e slurm_%A_%a.err
#SBATCH -o slurm_%A_%a.out
mkdir out_${SLURM_ARRAY_TASK_ID}
cd out_${SLURM_ARRAY_TASK_ID}
myapp ../input_${SLURM_ARRAY_TASK_ID}.txt
where output directories out1, out2, … are created for input files input1.txt, input2.txt,…
“Unrolling” for loops example
Original “serial” code (Python)
fibonacci = [0,1,1,2,3,5,8,13,21]
for i in range(len(fibonacci)):
print(i,fibonacci[i])
Job array version
import os
i=int(os.environ['SLURM_ARRAY_TASK_ID'])
fibonacci = [0,1,1,2,3,5,8,13,21]
#for i in range(len(fibonacci)):
print(i,fibonacci[i])
where the for loop is commented-out and each job task is doing a single “iteration”
tm103@dcc-login-02 ~/misc/jobarrays $ cat fib-array.sh
#!/bin/bash
#SBATCH -e slurm.err
module load Python/2.7.11
python fibonacci.py
tm103@dcc-login-02 ~/misc/jobarrays $ sbatch –a 1-8 fib-array.sh
Submitted batch job 24856052
tm103@dcc-login-02 ~/misc/jobarrays $ ls slurm-24856052_*
slurm-24856052_1.out slurm-24856052_3.out slurm-24856052_5.out slurm-24856052_7.out
slurm-24856052_2.out slurm-24856052_4.out slurm-24856052_6.out slurm-24856052_8.out
tm103@dcc-login-02 ~/misc/jobarrays $ cat slurm-24856052*
(1, 1)
(2, 1)
(3, 2)
(4, 3)
(5, 5)
(6, 8)
(7, 13)
(8, 21)
tm103@dcc-login-02 ~/misc/jobarrays $
Running MPI jobs
Supported MPI versions are Intel MPI and OpenMPI
Compiling with OpenMPI
tm103@dcc-login-02 ~ $ module load OpenMPI/4.0.5-rhel8
OpenMPI 4.0.5-rhel8
tm103@dcc-login-03 ~ $ mpicc -o openhello hello.c
tm103@dcc-login-02 ~ $ ls -l openhello
-rwxr-xr-x. 1 tm103 scsc 9184 Sep 1 16:08 openhello`
OpenMPI job script
#!/bin/bash
#SBATCH -o openhello.out
#SBATCH -e slurm.err
#SBATCH -n 20
module load OpenMPI/4.0.5-rhel8
mpirun -n $SLURM_NTASKS openhello
OpenMPI example output
tm103@dcc-login-02 ~/misc/slurm/openmpi $ cat openhello.out
dcc-core-01, rank 0 out of 20 processors
dcc-core-01, rank 1 out of 20 processors
dcc-core-01, rank 2 out of 20 processors
dcc-core-01, rank 3 out of 20 processors
dcc-core-01, rank 4 out of 20 processors
dcc-core-03, rank 13 out of 20 processors
dcc-core-03, rank 14 out of 20 processors
dcc-core-03, rank 10 out of 20 processors
dcc-core-03, rank 11 out of 20 processors
dcc-core-03, rank 12 out of 20 processors
dcc-core-02, rank 8 out of 20 processors
dcc-core-02, rank 9 out of 20 processors
dcc-core-02, rank 5 out of 20 processors
GPU jobs
To run a GPU batch job, add the job script lines
#SBATCH -p gpu-common
#SBATCH --gres=gpu:1
#SBATCH --exclusive
To get an interactive GPU node session, type the command line srun -p gpu-common --gres=gpu:1 --pty bash -i
tm103@dcc-clogin-02 ~ $ srun -p gpu-common --gres=gpu:1 --pty bash -i
tm103@dcc-gpu-01 ~ $ /usr/local/cuda-7.5/samples/1_Utilities/deviceQuery/deviceQuery
...
Detected 1 CUDA Capable device(s)
Device 0: "Tesla K80"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 3.7
Total amount of global memory: 11520 MBytes (12079136768 bytes)
(13) Multiprocessors, (192) CUDA Cores/MP: 2496 CUDA Cores
NEW: GPU types (coming 10/1/2022)
Request a specific GPU type and GPU device number with
#SBATCH --gres=gpu:(GPU type):(GPU number)
where "GPU number" was a number from 1 to 4 and "GPU type" was from the list below.
RTX6000
TITANX
P100
GTX1080
V100 (16 GB GPU RAM)
V10032 (32 GB GPU RAM)
RTXA5000
RTX2080 (Currently all core-gpu nodes and vast majority of scavenger-gpu nodes are these.)
Job dependencies
Submit a job that waits for another job to finish.
$ sbatch dep1.q
Submitted batch job 666898
Make a note of the assigned job ID of dep1
$ sbatch --dependency=afterok:666898 dep2.q
Job dep2 will not start until dep1 finishes
Job dependencies with arrays
Wait for specific job array elements
sbatch --depend=after:123_4 my.job sbatch --depend=afterok:123_4:123_8 my.job2
Wait for entire job array to complete
sbatch --depend=afterany:123 my.job
Wait for entire job array to complete successfully
sbatch --depend=afterok:123 my.job
Wait for entire job array to complete and at least one task fails
sbatch --depend=afternotok:123 my.job
More information: hcc-docs.unl.edu/display/HCCDOC/Job+Dependencies