Slurm Job Scheduler

CCI uses Slurm to allocate resources for compute jobs. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters. A full description can be found at https://slurm.schedmd.com/overview.html .

CCI uses cons_tres as our resource selection algorithm, to learn more, see this link: https://slurm.schedmd.com/cons_tres.html

NOTE! All compute jobs must be submitted to the Slurm queue and not directly run on the frontend node. Users running jobs running high intensity CPU/GPU loads on the front-end nodes will have their accounts locked.

Slurm Documentation: https://slurm.schedmd.com/archive/slurm-23.11.0/
Slurm FAQ in the official Slurm documentation: http://www.schedmd.com/slurmdocs/faq.html

Quick Reference Commands¶

squeue lists your jobs in the queue
sinfo lists the state of all nodes
spart lists available partitions and associated resources
sbatch submits batch scripts to the cluster
sprio shows the relative priorities of pending jobs in the queue and how they are calculated
sacct display accounting and submission data for jobs
scancel is used to cancel jobs
salloc allocates a node for interactive use

Resource specification¶

Commonly used options (see man sbatch for a complete list):

 -n, --ntasks=ntasks         number of tasks to run (-n 1 is a default and better left unspecified) 
 -N, --nodes=N               number of nodes on which to run (N = min[-max])
     --gres=<list>           number of resources needed.  In our case GPU or NVMe storage (--gres=gpu:1)
 -c, --cpus-per-task=ncpus   number of cpus required per task
     --ntasks-per-node=n     number of tasks to invoke on each node
     --cpus-per-gpu          
 -i, --input=in              file for batch script's standard input
 -o, --output=out            file for batch script's standard output
 -e, --error=err             file for batch script's standard error
 -p, --partition=partition   partition requested
 -t, --time=minutes          time you expect your job to compelete under
 -D, --chdir=path            change remote current working directory
 -D, --workdir=directory     set working directory for batch script
     --mail-type=type        notify on state change: BEGIN, END, FAIL or ALL
     --mail-user=user        who to send email notification for job state changes

Available QoS's¶

Slurm allows for QoS's to be requested to modify job limits and priorities. QoSs can be used by specifying --qos=QOS NAME.

QOS	Usage	Availablity	Time Limit	Other Limits
Interactive	Fast access to a single gpu for debugging/testing	All users and Clusters	30 Minutes	1 Job , 1 GPU
dcs-48hr	Jobs that cannot use checkpointing to run with 6 hours	All users on DCS	48 Hours	2 Jobs, 36 nodes
npl-48hr	Jobs that cannot use checkpointing to run with 6 hours	By request on NPL	48 Hours	1 Job, 4 nodes

Example job submission scripts¶

See also: Modules for any additional options/requirements of specific MPI implementations. Typically, it is necessary to load the same modules at runtime (before calling srun) that were used when building a binary.

Simple (non-MPI)¶

A simple (non-MPI) job can be started by simply calling the srun command:

#!/bin/bash -x
srun ./a.out your_application_name

But this is often too simple.

For example, the above jobs could be submitted to: - Run 16 tasks on 1 nodes on the partition "dcs-2024"

With the current working directory set to /foo/bar
Email notification of the job's state turned on
A time limit of four hours (240 minutes)
STDOUT redirected to /foo/bar/baz.out as follows:

sbatch -p dcs-2024 -N 1 -n 16 --mail-type=ALL --mail-user=example@rpi.edu -t 240 -D /foo/bar -o /foo/bar/baz.out ./YOUR_JOB_SCRIPT.sh

NOTE! In a simple, non-MPI case, running multiple tasks will create multiple instances of the same binary.

Submitting jobs via sbatch scripts¶

Alternatively, you may choose to submit via a batch file:

#!/bin/bash

#Slurm Directives
#SBATCH --mail-user=username@rpi.edu
#SBATCH --mail-type=end,fail
#SBATCH -N 4
#SBATCH --time=00:02:00
#SBATCH --gres=gpu:4
#SBATCH -p dcs-2024

# Clear the environment from any previously loaded modules
module purge > /dev/null 2>&1

# Load the module environment suitable for the job
module load gcc
module load xl_r
module load spectrum-mpi/10.4
module load cuda/11.2

#Optional debugging information
echo "Starting at `date`"
echo "Running on hosts: $SLURM_NODELIST"
echo "Running on $SLURM_NNODES nodes."
echo "Running $SLURM_NTASKS tasks."
echo "Current working directory is `pwd`"


#Run the script that does your work
srun /gpfs/u/home/PROJ/PROJuser/barn/my_job.sh

#Optional debugging information
echo "Program finished with exit code $? at: `date`"

Interactive Sessions¶

Interactive jobs are supported. See the srun command manual page for details.

Each cluster has an 'interactive' QoS that will grant users a high priority interactive session on a single node with one GPU for 30 minutes for testing/development purposes.

Example interactive session request:

salloc -t 30 --qos=interactive --gres=gpu:1 srun --pty bash -i

Or alternatively, allocate more resources and connect separately:

salloc -t 100 --gres=gpu:8 
ssh "$SLURM_JOB_NODELIST"

OpenMPI jobs¶

OpenMPI is a Message Passing Interface (MPI) library project combining technologies and resources from several other projects. (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).

FAQs for using OpenMPI with Slurm

NOTE! You must have passwordless SSH keys setup for mpirun to work.

Example OpenMPI job batch script:

    #!/bin/bash

    #SBATCH --job-name=openmpi_example
    #SBATCH --output=slurm-%A.%a.out
    #SBATCH --error=slurm-%A.%a.err
    #SBATCH --nodes=1
    #SBATCH --ntasks=1             
    #SBATCH --time=06:00:00

    module load gcc
    module load cuda/10.2
    module load openmpi

    mpirun --bind-to core --report-bindings -np $SLURM_NPROCS /path/to/your/executable

IBM Spectrum MPI or Mellanox HPC-X jobs (DCS cluster only)¶

IBM Spectrum MPI is a high-performance, implementation of Message Passing Interface (MPI), Specialized for the IBM POWER architecture DCS(AiMOS) uses.

NOTE! You must have passwordless SSH keys setup for mpirun to work.

If mpirun outputs ORTE was unable to reliably start one or more daemons. then it is likely that you need to setup SSH keys.

Example MPI job batch script:

#!/bin/bash -x

#SBATCH --job-name=spectrum_mpi_example
#SBATCH --output=slurm-%A.%a.out
#SBATCH --error=slurm-%A.%a.err
#SBATCH --nodes=1
#SBATCH --ntasks=1             
#SBATCH --time=06:00:00

module load spectrum-mpi
mpirun --bind-to core --report-bindings -np $SLURM_NPROCS /path/to/your/executable

#kokkos users should add: --kokkos-num-devices=N
#where N is the number of gpus being used on each node

GPU-Direct¶

To enable GPU-Direct ('CUDA aware MPI') pass the -gpu and -mca pml_pami_enable_gdrcpy 1 flags to mpirun

mpirun -gpu -mca pml_pami_enable_gdrcpy 1 /path/to/your/executable

Job arrays¶

Job arrays provide a way to submit collections of similar jobs. For details, see: https://slurm.schedmd.com/job_array.html

Two additional environment variables used: SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID

#!/bin/bash
#SBATCH --job-name=array-example
#SBATCH --output=slurm-%A.%a.out
#SBATCH --error=slurm-%A.%a.err
#SBATCH --nodes=1
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --time=06:00:00
#SBATCH --array=0-6              # job array with index values 0, 1, 2, 3, 4, 5, 6

echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID."
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "Executing on the node:" $(hostname)

./my-executable <options>

Job arrays with MPI¶

This example shows using MPI to run 36 MPI tasks across 4 compute nodes to execute 144 executions on the DCS(AiMOS) cluster.

#!/bin/sh
#SBATCH --job-name=array_example    # Job name
#SBATCH --nodes=1                   # Use one node
#SBATCH --ntasks=36                 # Number of tasks
#SBATCH --time=00:1:00              # Time limit hrs:min:sec
#SBATCH --gres=gpu:6                # Number of GPUs requested
#SBATCH --output=array_%A-%a.out    # Standard output and error log %A==slurm job id %a==slurm array task
#SBATCH --array=0-3                 # Array range (we want to launch 144 total executions of 'hello')

# we want 144 total executions of 'hello' binary and so we specify --array=0-3 (4 array tasks)
# becasue we're allowing a parallelism of 36 single core executions per array task using --ntasks=36

#Include extra context in output for debugging
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID."
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "Executing on the node:" $(hostname)

module load spectrum-mpi

# Calculate the starting and ending values for this task based
# on the SLURM task and the number of slurm tasks per array task
# NOTE: Slurm array task is not the same as a 'task', we have 4 total
# slurm array tasks and each of those is launching a 1 node job with 36 tasks
START_NUM=$(( $SLURM_ARRAY_TASK_ID * $SLURM_NTASKS + 1 ))
END_NUM=$(( ($SLURM_ARRAY_TASK_ID + 1) * $SLURM_NTASKS ))

# Using mpirun NOT srun
for (( run=$START_NUM; run<=END_NUM; run++ )); do
        mpirun -n 1 ./hello ${run} &
        pids[${run}]=$!
done

# So we see here that we can pass an argument that is based off of the slurm array task ID
# this allows us to have (for example) 144 different input files 
# like input and pass them like "input-file_${run}.txt"

# Technically it may be possible to just use a single wait with no arguments
# but this is more explicit to wait for each of the previously executed
# background jobs to complete.
for pid in ${pids[*]}; do
        wait $pid
done

Many small tasks¶

For many small tasks running simultaneously or in quick succession, it is often better to submit one large job rather than many small jobs. On some systems, doing otherwise leads to resource fragmentation and poor scheduler performance. This is particularitly true on DCS and NPL if your application cannot fully utilize a single GPU.

Example: This example requests 1 node with 1 GPU and run 4 programs on that GPU at the same time.

sbatch.sh

#!/bin/sh
#SBATCH --job-name=test
#SBATCH -t 03:00:00
#SBATCH -N1
#SBATCH --gres=gpu:1

srun overload.sh &
wait

overload.sh

./my-executable <options> > testing1.log &
./my-executable <options> > testing2.log &
./my-executable <options> > testing3.log &
./my-executable <options> > testing4.log &
wait

Note the ampersand (&) at then end of each executable invocation and the wait command at the end. This will run all 4 jobs in parallel within the single allocation and wait until all 4 are complete.

Verifying allocated resources¶

Once a job is requested via salloc or sbatch you can confirm you a receiving the requested resources by running the scontrol show job <JOB ID> command.

Useful Tips¶

GPU usage requirement¶

On all CCI clusters with GPUs, it is required to request at least ONE GPU using the --gres=gpu:1 option, even if you do not plan to use it. Failing to do will result in an your job failing to schedule.

Resource Usage/Priority¶

Priority is based on prior use. Users with high activity may find their priority becoming increasingly lower over time. Run: sprio -j <JOB ID to see your job's priority and fair share values.
Slurm's accounting system is based on actually used resources, rather than resources requested. If, for some reason your job does not finish due to a hardware failure or administrative action, your priority is not affected.

Exclusive Flag¶

Do not use the --exclusive or --exclusive=1 parameters. They make non-obvious changes to jobs behind the scenes that are equivalent to requesting all the resources on a node. Requesting all available GPUs on a system is functionally the same as exclusive access.
If for some reason fewer resources are required but exclusive access is required, use --exclusive=user. This will allow other jobs from your own account to use the other idle GPUs.

Watching Slurm commands¶

When using watch to monitor the output of a Slurm command, be sure to set an update interval of at least 60 seconds, ex: watch -n 120 squeue. The output is unlikely to change any faster than this and updating too frequently causes undue load on the scheduler.