Slurm Job Scheduler
CCI uses slurm to allocate resources for compute jobs. Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. A full description of the software can be found at https://slurm.schedmd.com/overview.html . We use cons_tres as our resource selection algorithm.
All compute jobs must be submitted to the slurm queue and not directly run on the frontend node. Users running jobs on the frontend node will be locked out of the system.
- Slurm Documentation: https://slurm.schedmd.com/archive/slurm-23.11.0/
- Slurm FAQ in the official Slurm documentation: http://www.schedmd.com/slurmdocs/faq.html
Quick Reference¶
Commands¶
- squeue lists your jobs in the queue
- sinfo lists the state of all machines in the cluster
- sbatch submits batch jobs
- sprio lists the relative priorities of pending jobs in the queue and how they are calculated
- sacct display accounting and submission data for jobs
- scancel is used to cancel jobs
- salloc allocates a node for interactive use
Notes¶
Exclusive Flag¶
- Do not use the --exclusive or --exclusive=1 paramaters. They make non obvious changes to your job behind the scenes that are equivalent to requesting all the resources on a node. If you request all the GPUs on a node you are already getting exclusive access. If you need fewer resources but want to use exclusive please use --exclusive=user. That will allow other jobs of your own account to use the other idle GPUs.
Watching Slurm commands¶
When using watch
to monitor the output of a Slurm command, please make sure
to set an update interval of at least 60 seconds, ex: watch -n 120 squeue
.
The output is unlikely to change any faster than this and updating too
frequently can cause undue load on the scheduler.
Resource specification¶
Options of interest (see the manual page for sbatch for a complete list):
-n, --ntasks=ntasks number of tasks to run (-n 1 is a default and better left unspecified)
-N, --nodes=N number of nodes on which to run (N = min[-max])
--gres=<list> number of resources needed. In our case GPU or NVMe storage (--gres=gpu:1)
-c, --cpus-per-task=ncpus number of cpus required per task
--ntasks-per-node=n number of tasks to invoke on each node
--cpus-per-gpu
-i, --input=in file for batch script's standard input
-o, --output=out file for batch script's standard output
-e, --error=err file for batch script's standard error
-p, --partition=partition partition requested
-t, --time=minutes time you expect your job to compelete under
-D, --chdir=path change remote current working directory
-D, --workdir=directory set working directory for batch script
--mail-type=type notify on state change: BEGIN, END, FAIL or ALL
--mail-user=user who to send email notification for job state changes
Note that any of the above can be specified in a batch file by preceeding the option with #SBATCH. All options defined this way must appear first in the batch file with nothing separating them. For example, the following will send the job's output to a file called joboutput.<the job's ID>:
#SBATCH -o joboutput.%J
Available QOS's¶
Slurm allows for QOS's to be requested to modify job limits and priorities. CCI has a couple QOSs that can be used by specifying --qos=QOS NAME
.
QOS | Usage | Availablity | Time Limit | Other Limits |
---|---|---|---|---|
Interactive | Fast access to a single gpu for debugging | All users and Clusters | 30 Minutes | 1 Job , 1 GPU |
dcs-48hr | Jobs that cannot use checkpointing to run with 6 hours | All users on DCS | 48 Hours | 2 Jobs, 36 nodes |
npl-48hr | Jobs that cannot use checkpointing to run with 6 hours | By request on NPL | 48 Hours | 1 Job, 4 nodes |
Example job submission scripts¶
See also: Modules for any additional options/requirements of specific MPI implementations. Typically, it is necessary to load the same modules at runtime (before calling srun) that were used when building a binary.
Simple (non-MPI)¶
A simple (non-MPI) job can be started by just calling srun:
#!/bin/bash -x
srun ./a.out your_application_name
For example, the above jobs could be submitted to run 16 tasks on 1 nodes, in the partition "cluster", with the current working directory set to /foo/bar, email notification of the job's state turned on, a time limit of four hours (240 minutes), and STDOUT redirected to /foo/bar/baz.out as follows (where script.sh is the script):
sbatch -p cluster -N 1 -n 16 --mail-type=ALL --mail-user=example@rpi.edu -t 240 -D /foo/bar -o /foo/bar/baz.out ./script.sh
Note: In a simple, non-MPI case, running multiple tasks will create multiple instances of the same binary.
Interactive¶
Interactive jobs are supported. See the srun command manual page for details. Here is a usage example launching a shell on the compute node allocated to an interactive session:
salloc -t 30 --qos=interactive --gres=gpu:1 srun --pty bash -i
Or by an alternative method to allocate and connect seperatly:
salloc -t 100 --gres=gpu:8
ssh "$SLURM_JOB_NODELIST"
OpenMPI¶
Open MPI & Slurm
Example job batch script slurmOpenMpi.sh
:
=======
IBM Spectrum MPI or Mellanox HPC-X¶
These implementations do not have direct Slurm support and it is
necessary to use mpirun. You must have passwordless SSH keys setup for
mpirun to work. If mpirun outputs
ORTE was unable to reliably start one or more daemons.
then you need
to setup SSH keys.
Example job batch script slurmSpectrum.sh
:
1 2 3 4 5 6 7 |
|
GPU-Direct¶
To enable GPU-Direct ('CUDA aware MPI') pass the -gpu
and -mca pml_pami_enable_gdrcpy 1
flags to mpirun, ex:
mpirun -gpu -mca pml_pami_enable_gdrcpy 1 /path/to/your/executable
Job arrays¶
Job arrays provide a way to submit collections of similar jobs. https://slurm.schedmd.com/job_array.html Two additional environment variables will be passed, SLURM_ARRAY_JOB_ID and SLURM_ARRAY_TASK_ID
#!/bin/bash
#SBATCH --job-name=array-example
#SBATCH --output=slurm-%A.%a.out
#SBATCH --error=slurm-%A.%a.err
#SBATCH --nodes=1
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --time=06:00:00
#SBATCH --array=0-6 # job array with index values 0, 1, 2, 3, 4, 5, 6
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID."
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "Executing on the node:" $(hostname)
./my-executable <options>
Job arrays with MPI¶
This example shows using MPI to run 36 mpi tasks accros 4 compute nodes to execute 144 executions. It is written for DCS and uses spectrum-mpi, on NPL substitute openmpi .
#!/bin/sh
#SBATCH --job-name=array_example # Job name
#SBATCH --nodes=1 # Use one node
#SBATCH --ntasks=36
#SBATCH --time=00:1:00 # Time limit hrs:min:sec
#SBATCH --gres=gpu:6
#SBATCH --output=array_%A-%a.out # Standard output and error log %A==slurm job id %a==slurm array task
#SBATCH --array=0-3 # Array range (we want to launch 144 total executions of 'hello')
# we want 144 total executions of 'hello' binary and so we specify --array=0-3 (4 array tasks)
# becasue we're allowing a parallelism of 36 single core executions per array task using --ntasks=36
#Include extra context in output for debugging
echo "SLURM_ARRAY_JOB_ID: $SLURM_ARRAY_JOB_ID."
echo "SLURM_ARRAY_TASK_ID: $SLURM_ARRAY_TASK_ID"
echo "Executing on the node:" $(hostname)
# using mpirun NOT srun as srun still has kinks for job steps at CCI
module load spectrum-mpi
# Calculate the starting and ending values for this task based
# on the SLURM task and the number of slurm tasks per array task
# NOTE: Slurm array task is not the same as a 'task', we have 4 total
# slurm array tasks and each of those is launching a 1 node job with 36 tasks
START_NUM=$(( $SLURM_ARRAY_TASK_ID * $SLURM_NTASKS + 1 ))
END_NUM=$(( ($SLURM_ARRAY_TASK_ID + 1) * $SLURM_NTASKS ))
for (( run=$START_NUM; run<=END_NUM; run++ )); do
mpirun -n 1 ./hello ${run} &
pids[${run}]=$!
done
# So we see here that we can pass an argument that is based off of the slurm array task ID
# this allows us to have (for example) 144 different input files
# like input and pass them like "input-file_${run}.txt"
# Technically it may be possible to just use a single wait with no arguments
# but this is more explicit to wait for each of the previously executed
# background jobs to complete.
for pid in ${pids[*]}; do
wait $pid
done
Many small tasks¶
For many small tasks running simultaneously or in quick succession, it is often better to submit one large job rather than many small jobs. On some systems, doing otherwise leads to resource fragmentation and poor scheduler performance. This is particularitly true on DCS and NPL if your program cannot fully utilize a single GPU.
Example: This example requests 1 node with 1 GPU and run 4 programs on that GPU at the same time.
sbatch.sh
#!/bin/sh
#SBATCH --job-name=test
#SBATCH -t 03:00:00
#SBATCH -N1
#SBATCH --gres=gpu:1
srun overload.sh &
wait
overload.sh
./my-executable <options> > testing1.log &
./my-executable <options> > testing2.log &
./my-executable <options> > testing3.log &
./my-executable <options> > testing4.log &
wait
Note the ampersand (&) at then end of each executable invocation and the wait
command at the end. This will run all 4 jobs
in parallel within the single allocation and wait until all 4 are complete.
Verifying allocated resources¶
Once you have requested a job via salloc or sbatch you can confirm you a receiving the requested resources by running the scontrol show job #####
command. This will output information about requested and allocated resources.