skip to content
 

Note: this document is meant to be accurate "to the first approximation" so as not to unnecessarily confuse the readers with more complicated usage scenarios, edge cases and so on.

Will this document tell me everything I need to know?

NO. The purpose of this FAQ is to give Maths users who have not worked with Slurm before, or haven't used any job scheduler yet, the bare minimum of information to get going. For further education on the subject we highly recommend this UIS course, the official "Introduction to Slurm" tutorial (see here), and Slurm user documentation.

Why should I use Slurm? And what is it anyway?

As mentioned above, Slurm is "a job scheduler". Unpacking this into simpler terms, given a computational job to run Slurm will check if requested resources are available, and either put the job on hold until they are or allocate them, dispatch the job, monitor its execution and clean up afterwards. The "putting on hold" part also takes care of assigning priorities to different scheduled jobs depending on the amount of requested resources, the user's share in the system, how long they have already been sitting in the queue and so on.

So, what are the advantages of using Slurm over simply connecting to the target machine over SSH and running your jobs by hand?

  • once your job has been allocated resources they are its for the duration of its execution - regardless if the target system is simultaneously used by five or five hundred users;
  • no need to keep an open terminal session, and/or to employ various "survive disconnection" tricks such as a terminal multiplexer, for as long as your job is running;
  • better use of computing resources out of working hours - queued jobs get started as soon as they can acquire the requested resources regardless of the time of day, day of the week and so on.

How do I make my job use Slurm?

In the most simple case, all you have to do is prefix the name of your program or script with "srun ". That's it! Consider the following example, in which our "job" is the command printing the name of the host it has been run on:

mn01:~> hostname
mn01
mn01:~> srun hostname
srun: job 1234567 queued and waiting for resources
srun: job 1234567 has been allocated resources
cosmosx

Sounds great, will the most simple case work for me?

In quite a lot of cases it might, with one major caveat - calling srun with no additional parameters causes your job to rely entirely on the defaults set by the cluster administrators. In particular, it submits your job to the default Slurm partition - a subset of cluster resources which may or may not include the host on which you want your job to run.

In order to specify the partition to run on, use the srun parameter "-p". For instance, assuming the cluster has got a non-default partition called "gpu" comprising servers capable of performing calculations on GPUs, you would (leaving aside the subject of actually requesting GPUs for your job, for simplicity) run something like

srun -p gpu my_gpu_benchmark

On top of coarse resource selection associated with choosing a partition Slurm also allows the user to request the desired amount of specific resources such as the number of CPU theads per node, the number of nodes, the amount of RAM, the estimated run time (which if shorter than the maximum allowed for the partition might improve your chances of earlier execution) and many others. Consult the documentation for details.

"srun myjob" remains attached to my terminal, wasn't I supposed to avoid that with Slurm?

Yes and no - in order to have your jobs run detached you must combine srun with another Slurm command, sbatch. The basic procedure is as follows:

  • write a shell script (a "batch script") containing your invocation of srun myjob (yes, you do still need srun in this case) as well as any preparatory steps, such as the loading of appropriate environment modules, it may need;
  • for all the srun arguments specifying the partition to use, the required resources and so on, add lines near the top of your script which start with #SBATCH followed by respective arguments, then remove said arguments from the srun line;
  • submit your job with "sbatch myjobscript".

For a reasonably comprehensive, annotated example of a Slurm batch script, see the end of this FAQ.

I have submitted a batch job, how can I tell what its status is?

You can use the command squeue to find out which of your jobs have been started yet and which ones are still queuing for resources. It can also tell you what jobs are in the given partition's queue ahead of yours, keep in mind however that a job queued further down the list but requesting fewer resources that the one(s) on top might get started earlier if the alternative were for some of the cluster resources to remain idle for a significant amount of time;

Furthermore, by default Slurm captures console output of a job to a file named slurm-<jobid>.out located in the directory from which the batch script has been submitted (when in doubt, run the command scontrol show job <jobid> and look at the keys: StdErr and StdOut. This file can be examined both once the jobs has finished and while it is still running.

Finally, Slurm jobs can be configured so that specific events in the life cycle of a job are communicated by e-mail; see the documentation of the command sbatch for details.

I would like my multithreaded jobs to be able to actually have threads running in parallel

Simply give your job the argument "--cpus-per-task=X" - where X is the number of threads you want to be able to be active at the same time. Note, however, that unlike most other job arguments it must be given to both srun and sbatch, i.e. you do not remove it from the srun command line when you call it from a batch script even though you do have to pass it to sbatch as well.

Another commonly used way is to give your job the arguments "--nodes=1 --ntasks=X", taking advantage of the fact that in Slurm parlance a task is simply something that occupies a CPU regardless of whether it is spawned by multiple executables or a single one. The advantage of this approach is that these arguments do not have to be passed to both srun and sbatch, on the other hand it can result in less optimal scheduling of your job by Slurm and can cause unexpected behaviour under some circumstances - not in the least when you forget to make sure you limit your tasks to a single node! Avoid using this approach; it is only mentioned here because it turns up quite often in Web answers to this question.

How do I run MPI jobs under Slurm?

This is mostly beyond the scope of this FAQ, especially given that the details vary between different MPI implementations. There are, however, two points worth keeping in mind which apply to all of the more common implementations:

  • Slurm will generally take care of letting MPI processes in a multi-node job know where its peers are, and might help them set up appropriate communication channels;
  • although it might still work for some MPI implementations, it is generally recommended NOT to use mpiexec or mpirun to run MPI jobs. Use srun instead, possibly with the option "--mpi" to select the right integration mode (you can see the list of the ones available on your cluster by running srun --mpi=help) if the system default does not work for you.

For more details regarding MPI and Slurm, including instructions for configuring different popular MPI implementations, see the Slurm MPI and UPC Users Guide.

Appendix A: Summary of useful Slurm commands

Many of these have not been discussed above but you might find them handy at some point. Consult their respective manual pages for details.

squeue      - show global cluster information
sinfo       - show global cluster information
sview       - show global cluster information
scontrol show job nnnn - examine the job with job ID nnnn
scontrol show node nodename - examine the node with name nodename
sbatch      - submits an executable script to the queueing system
srun        - run a command either as a new job or within an existing job
scancel     - delete a job

Appendix B: Example Slurm batch script

This is an example Slurm batch script from the CSD3 cluster at UIS, which might be a good start for writing your own scripts of this sort. Do keep in mind that cluster-specific settings such as names of Slurm partitions or the names of environment modules to use are pretty much guaranteed not to work as-is, and if you use MPI also the "use srun rather than mpirun if possible" clause above.

#!/bin/bash
#!
#! Example SLURM job script for Peta4-Skylake (Skylake CPUs, OPA)
#! Last updated: Mon 13 Nov 12:25:17 GMT 2017
#!

#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################

#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J cpujob
#! Which project should be charged:
#SBATCH -A CHANGEME
#! How many whole nodes should be allocated?
#SBATCH --nodes=2
#! How many (MPI) tasks will there be in total? (1.
#! Each task is allocated 1 core by default, and each core is allocated 5980MB (skylake)
#! and 12030MB (skylake-himem). If this is insufficient, also specify
#! --cpus-per-task and/or --mem (the latter specifies MB per node).

#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$(echo "$SLURM_TASKS_PER_NODE" | sed -e  's/^\([0-9][0-9]*\).*$/\1/')
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:

#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh                # Leave this line (enables the module command)
module purge                               # Removes all modules still loaded
module load rhel7/default-peta4            # REQUIRED - loads the basic environment

#! Insert additional module load commands after this line if needed:

#! Full path to application executable:
application=""

#! Run options for the application:
options=""

#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR"  # The value of SLURM_SUBMIT_DIR sets workdir to the directory
                             # in which sbatch is run.

#! Are you using OpenMP (NB this is unrelated to OpenMPI)? If so increase this
#! safe value to no more than 32:
export OMP_NUM_THREADS=1

#! Number of MPI tasks to be started by the application per node and in total (do not change):
np=$[${numnodes}*${mpi_tasks_per_node}]

#! The following variables define a sensible pinning strategy for Intel MPI tasks -
#! this should be suitable for both pure MPI and hybrid MPI/OpenMP jobs:
export I_MPI_PIN_DOMAIN=omp:compact # Domains are $OMP_NUM_THREADS cores in size
export I_MPI_PIN_ORDER=scatter # Adjacent domains have minimal sharing of caches/sockets
#! Notes:
#! 1. These variables influence Intel MPI only.
#! 2. Domains are non-overlapping sets of cores which map 1-1 to MPI tasks.
#! 3. I_MPI_PIN_PROCESSOR_LIST is ignored if I_MPI_PIN_DOMAIN is set.
#! 4. If MPI tasks perform better when sharing caches/sockets, try I_MPI_PIN_ORDER=compact.


#! Uncomment one choice for CMD below (add mpirun/mpiexec options if necessary):

#! Choose this for a MPI code (possibly using OpenMP) using Intel MPI.
CMD="mpirun -ppn $mpi_tasks_per_node -np $np $application $options"

#! Choose this for a pure shared-memory OpenMP parallel program on a single node:
#! (OMP_NUM_THREADS threads will be created):
#CMD="$application $options"

#! Choose this for a MPI code (possibly using OpenMP) using OpenMPI:
#CMD="mpirun -npernode $mpi_tasks_per_node -np $np $application $options"


###############################################################
### You should not have to change anything below this line ####
###############################################################

cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=$SLURM_JOB_ID

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

if [ "$SLURM_JOB_NODELIST" ]; then
        #! Create a machine file:
        export NODEFILE=`generate_pbs_nodefile`
        cat $NODEFILE | uniq > machine.file.$JOBID
        echo -e "\nNodes allocated:\n================"
        echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD