SLURM Workload Manager

Running Jobs

General

The Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

Architecture

The entities managed by these SLURM daemons, include nodes, the compute resource in SLURM, partitions, which group nodes into logical (possibly overlapping) sets, jobs, or allocations of resources assigned to a user for a specified amount of time, and job steps, which are sets of (possibly parallel) tasks within a job. The partitions can be considered job queues, each of which has an assortment of constraints such as job size limit, job time limit, users permitted to use it, etc.

SLURM Commands

These are the SLURM commands frequently used on DC³:

sinfo -p aegir is used to show the state of partitions and nodes managed by SLURM:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
aegir        up 1-00:00:00     28  alloc node[164-172,174-180,441-452]
aegir        up 1-00:00:00      1   idle node173

This shows that there are 29 nodes available (up) in the aegir partition, 28 of them are ocuppied (alloc) and 1 is free (idle) with maximum runtime per job (TIMELIMIT) of 24 hours. Node named like 164-180 have 16 CPU cores and nodes 441-452 have 32 CPU cores per node.

To see detail specifics of all partitions, one must use:

scontrol show partition aegir

PartitionName=aegir
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO QoS=N/A
   DefaultTime=NONE DisableRootJobs=YES ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=1-00:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
   Nodes=node[164-180,441-452]
   PriorityJobFactor=320 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=1312 TotalNodes=29 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

This output shows us in detail that:
Anyone can submit a job to the aegir partition (AllowGroups=ALL) The walltime limit on the aegir partition is 1 day (MaxTime=1-00:00:00) It is important to understand that “TotalCPUs=1312” number shows how many cores + threads (2 per core) are at maximum available on the DC³ cluster.

scontrol show Node=node164 shows information about node164

NodeName=node164 Arch=x86_64 CoresPerSocket=8 
   CPUAlloc=32 CPUTot=32 CPULoad=11.90
   AvailableFeatures=v1
   ActiveFeatures=v1
   Gres=(null)
   NodeAddr=node164 NodeHostName=node164 Version=18.08
   OS=Linux 3.10.0-957.12.2.el7.x86_64 #1 SMP Tue May 14 21:24:32 UTC 2019 
   RealMemory=64136 AllocMem=16384 FreeMem=57967 Sockets=2 Boards=1
   State=ALLOCATED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=aegir,bylgja 
   BootTime=2019-07-29T21:52:54 SlurmdStartTime=2020-01-22T18:09:02
   CfgTRES=cpu=32,mem=64136M,billing=32
   AllocTRES=cpu=32,mem=16G
   CapWatts=n/a
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

This output, which is edited for length, shows us a number of things:
Node 164 has a job running (CPUAlloc), it also shows that there are 2 threads per core (ThreadsPerCore), 32 cores available (CPUTot), amount of memory on the node (RealMemory in Mb) and free disk space (TmpDisk) etc.

squeue -p aegir command is used to show the jobs in the queueing system. The command gives an output similar to this:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          16310118     aegir cesmi6ga    guido PD       0:00      2 (Resources)
          16306731     aegir  s40flT0 jlohmann  R    1:36:53      1 node172
          16309537     aegir i6gat31i   nutrik  R    2:18:33      5 node[441,443-446]
          16307317     aegir Ctrl2dFl jlohmann  R    4:56:44      1 node442
          16306131     aegir   RTIPw1 jlohmann  R    5:35:45      1 node164
          16305493     aegir   S2d360 jlohmann  R    7:29:17      1 node452
          16303418     aegir cesmi6ga    guido  R    9:53:55      4 node[166,177-179]
          16301838     aegir pit31drb   nutrik  R   11:33:27      5 node[447-451]
          16299026     aegir cesmi6ga    guido  R   15:32:06      2 node[170-171]
          16298749     aegir cesmi6ga    guido  R   16:11:26      2 node[165,176]
          16297229     aegir cesmi6ga    guido  R   18:33:01      2 node[168-169]

This partial output shows us that:
nutrik is running in the aegir partition, on nodes [441,443-446] and [447-451] two different jobs. guido is currently queueing in the aegir partition with his job cesmi6ga waiting for a free available nodes (PD).

More generally, the output shows us the following:
The first column is the JOBID, which is used for terminatation or modification of the job. The second column is the partition the job is running in. The third column is the job name. The fourth column is the user’s name of the person queueing the job. The fifth column is the state of the job. Some of the possible job states are as follows: PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), NF (node failure) and SE (special exit state).PD (pending), R (running), CA (cancelled), CF(configuring), CG (completing), CD (completed), F (failed), TO (timeout), NF (node failure) and SE (special exit state). The sixth column is the job runtime. The seventh & eight columns are the number of allocated nodes and the nodes list the job is running on.

sbatch $BATCH_FILE is used to submit a job script for execution. The script will typically contain one or more srun commands to launch parallel tasks.

scancel $JOBID is used to cancel a pending or running job or job step. It can also be used to send an arbitrary signal to all processes associated with a running job or job step.

SLURM example script:

#!/bin/sh
#
#SBATCH -p aegir
#SBATCH -A ocean
#SBATCH --job-name=myjob
#SBATCH --time=00:30:00
#SBATCH --constraint=v1
#SBATCH --nodes=2
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mymail@nbi.ku.dk
#SBATCH --output=slurm.out

srun --mpi=pmi2 --kill-on-bad-exit my_program.exe

Then submit the script: sbatch ./my_batch_script.sh

In this example we use aegir partition to run my_program.exe, set our jobname, request 30 minutes of runtime and nodes with 16 cores (--constraint=v1), 2 nodes and 32 cores (with one task per core), no sharing of nodes resources, send e-mail notifications and define file name for standard job output. One can request a node with 32 cores, but in this case SLURM batch script looks like:

#!/bin/sh
#
#SBATCH -p aegir
#SBATCH -A ocean
#SBATCH --job-name=myjob
#SBATCH --time=00:30:00
#SBATCH --constraint=v2
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --cpus-per-task=1
#SBATCH --exclusive
#SBATCH --mail-type=ALL
#SBATCH --mail-user=mymail@nbi.ku.dk
#SBATCH --output=slurm.out

srun --mpi=pmi2 --kill-on-bad-exit my_program.exe

Mass Storage

Quota

There is no individual user quota but group quota with a limited amount of space which is enforced by the file system. If this limit is exceeded, the whole group will not be able to write new data.

You can check the current use with: lfs quota -h /lustre/hpc

More information on mass storage and workload manager can be found here: https://hpc.ku.dk