SLURM Workload Manager

Running Jobs
General
The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
Architecture
The entities managed by these SLURM daemons, include nodes, the compute resource in SLURM, partitions, which group nodes into logical (possibly overlapping) sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc.
SLURM Commands
These are the SLURM commands frequently used on DC3:
sinfo -p aegir is used to show the state of partitions and nodes managed by SLURM:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
aegir up 1-00:00:00 28 alloc node[164-172,174-180,441-452]
aegir up 1-00:00:00 1 idle node173
This shows that there are 29 nodes available (up) in the aegir partition, 28 of them are ocuppied (alloc) and 1 is free (idle) with maximum runtime per job (TIMELIMIT) of 24 hours. Node named like 164-180 have 16 CPU cores and nodes 441-452 have 32 CPU cores per node.
To see detail specifics of all partitions, one must use:
scontrol show partition aegir
PartitionName=aegir
AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=NO QoS=N/A
DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=node[164-180,441-452]
PriorityJobFactor=320 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=1312 TotalNodes=29 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
This output shows us in detail that:
Anyone can submit a job to the aegir partition (AllowGroups=ALL)
The walltime limit on the aegir partition is 1 day (MaxTime=1-00:00:00)
It is important to understand that “TotalCPUs=1312” number shows how many cores + threads (2 per core) are at maximum available on the DC3 cluster.
scontrol show Node=node164 shows information about node164
NodeName=node164 Arch=x86_64 CoresPerSocket=8
CPUAlloc=32 CPUTot=32 CPULoad=11.90
AvailableFeatures=v1
ActiveFeatures=v1
Gres=(null)
NodeAddr=node164 NodeHostName=node164 Version=18.08
OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019
RealMemory=64136 AllocMem=16384 FreeMem=57967 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=aegir,bylgja
BootTime=2019-07-29T21:52:54 SlurmdStartTime=2020-01-22T18:09:02
CfgTRES=cpu=32,mem=64136M,billing=32
AllocTRES=cpu=32,mem=16G
CapWatts=n/a
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
This output, which is edited for length, shows us a number of things:
Node 164 has a job running (CPUAlloc), it also shows that there are 2 threads per core (ThreadsPerCore), 32 cores available (CPUTot), amount of memory on the node (RealMemory in Mb) and free disk space (TmpDisk) etc.
squeue -p aegir command is used to show the jobs in the queueing system. The command gives an output similar to this:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
16310118 aegir cesmi6ga guido PD 0:00 2 (Resources)
16306731 aegir s40flT0 jlohmann R 1:36:53 1 node172
16309537 aegir i6gat31i nutrik R 2:18:33 5 node[441,443-446]
16307317 aegir Ctrl2dFl jlohmann R 4:56:44 1 node442
16306131 aegir RTIPw1 jlohmann R 5:35:45 1 node164
16305493 aegir S2d360 jlohmann R 7:29:17 1 node452
16303418 aegir cesmi6ga guido R 9:53:55 4 node[166,177-179]
16301838 aegir pit31drb nutrik R 11:33:27 5 node[447-451]
16299026 aegir cesmi6ga guido R 15:32:06 2 node[170-171]
16298749 aegir cesmi6ga guido R 16:11:26 2 node[165,176]
16297229 aegir cesmi6ga guido R 18:33:01 2 node[168-169]
This partial output shows us that:
nutrik is running in the aegir partition, on nodes [441,443-446] and [447-451] two different jobs.
guido is currently queueing in the aegir partition with his job cesmi6ga waiting for a free available nodes (PD).
More generally, the output shows us the following:
The first column is the JOBID, which is used for terminatation or modification of the job.
The second column is the partition the job is running in.
The third column is the job name.
The fourth column is the user’s name of the person queueing the job.
The fifth column is the state of the job. Some of the possible job states are as follows:
PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), NF (node failure) and
SE (special exit state).PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout),
NF (node failure) and SE (special exit state).
The sixth column is the job runtime.
The seventh & eight columns are the number of allocated nodes and the nodes list the job is running on.
sbatch $BATCH_FILE is used to submit a job script for execution. The script will typically contain one or more srun commands to launch parallel tasks.
scancel $JOBID is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.
SLURM example script:
#!/bin/sh
#
#SBATCH -p aegir
#SBATCH -A ocean
#SBATCH --job-name=myjob
#SBATCH --time=00:30:00
#SBATCH --constraint=v1
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mymail@nbi.ku.dk
#SBATCH --output=slurm.out
srun --mpi=pmi2 --kill-on-bad-exit my_program.exe
Then submit the script: sbatch ./my_batch_script.sh
In this example we use aegir partition to run my_program.exe, set our jobname, request 30 minutes of runtime and nodes with 16 cores (--constraint=v1), 2 nodes and 32 cores (with one task per core), no sharing of nodes resources, send e-mail notifications and define file name for standard job output. One can request a node with 32 cores, but in this case SLURM batch script looks like:
#!/bin/sh
#
#SBATCH -p aegir
#SBATCH -A ocean
#SBATCH --job-name=myjob
#SBATCH --time=00:30:00
#SBATCH --constraint=v2
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mymail@nbi.ku.dk
#SBATCH --output=slurm.out
srun --mpi=pmi2 --kill-on-bad-exit my_program.exe
Mass Storage
Quota
There is no individual user quota but group quota with a limited amount of space which is enforced by the file system. If this limit is exceeded, the whole group will not be able to write new data.
You can check the current use with: lfs quota -h /lustre/hpc
More information on mass storage and workload manager can be found here: https://hpc.ku.dk