DCS Cluster
Accessing AiMOS (DCS) Cluster¶
Front-end nodes: dcsfen01 and dcsfen02
All compute jobs must be submitted to the slurm queue and not directly run on the front-end node.
System information¶
System contains 268 nodes, each with:
- 2x IBM Power 9 processors clocked at 3.15 GHz. Each processor contains 20 cores with 4 hardware threads (160 logical processors per node).
- 512 GiB RAM
- 1.6 TB Samsung NVMe Flash
Within the total cluster, 16 nodes each contain:
- 4x NVIDIA Tesla V100 GPUs with 16 GiB of memory each
Within the total cluster, 252 nodes each contain:
- 6x NVIDIA Tesla V100 GPUs with 32 GiB of memory each
Four nodes each contain a Nallatech FPGA.
All nodes are connected with EDR Infiniband and connect to the unified File System.
Performance¶
A single Nvidia Volta V100 GPU can theoretically perform 7.5 TeraFLOPs double precision and achieves 855 GB/s on the Stream Triad benchmark. These results are from Nvidia Tesla V100 Data Sheet and a 2017 HotChips presentation
Combined, the two Power9 CPUs on an AiMOS node can perform 1 TeraFLOPs in double precision.
2 sockets * 20 cores/socket * 8 flops/cycle/core * 3.15 giga-cycles/second ~= 1 TeraFLOPs double precision
The flops/cycle/core is from Dirk Pleiter's SC18 tutorial "IBM POWER9 Processor,NVIDIA V100 GPU andIBM AC922 Node Hardware Architecture": https://indico-jsc.fz-juelich.de/event/84/session/4/contribution/8/material/slides/0.pdf
Note, to measure performance of your application it is strongly suggested that you disable node sharing with other jobs by passing --gres=gpu:6
(or --gres=gpu:16g:4
on the nodes with four GPUs) to Slurm.
Building software¶
Many packages for building software are available as modules. However, some tools are available without loading any modules and a subset of those packages can be overridden by modules. Please pay careful attention to which modules you have loaded.
Build systems/tools:
- ninja 1.7.2
- cmake 3.13.4 (cmake3)
- autoconf 2.69
- automake 1.13.4
Compilers:
- gcc 4.8.5
- clang/llvm 3.4.2
Currently the following are available as modules:
- automake 1.16.1
- bazel 0.17.2, 0.18.0, 0.18.1, 0.21.0
- ccache 3.5
- cmake 3.12.2, 3.12.3
- gcc 6.4.0, 6.5.0, 7.4.0, 8.1.0, 8.2.0
- xl/xl_r (xlC and xlf) 13.1.6, 16.1.0
- MPICH 3.2.1 (mpich module, built with XL compiler)
- CUDA 9.1, 10.0
CUDA¶
When mixing CUDA and MPI, please make sure an xl module is loaded and
nvcc is called with -ccbin $(CXX)
otherwise linking will fail.
CUDA code should be compiled with -arch=sm_70
for the Volta V100 GPUs.
XL MPICH Compiler Wrapper Flags¶
The default MPICH compiler wrapper flags -O3 -qipa -qhot
will perform
aggressive optimizations that could alter the semantics of your program.
If the compiler applies such an optimization the following warning
message will be displayed:
1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information.
Specifying flags on the command line will override these defaults. For
example, the following flags will respectively reduce the optimization
level, add debug symbols, and block semantic changing optimizations:
-O2 -g -qstrict
.
More information on the XL compiler options is here:
Modules¶
The following modules are installed via Spack environments, are considered experimental, and are subject to change.
GCC 8.4.1 + Spectrum MPI: recommended for C++ that cannot be built with XL
module use /gpfs/u/software/dcs-rhel8-spack-install/v0162gccSpectrum/lmod/linux-rhel8-ppc64le/Core/
module load spectrum-mpi/10.4-2ycgnlq
module load cmake/3.20.0/1 cuda/11.1-3vj4c72
export OMPI_CXX=g++
export OMPI_CC=gcc
export OMPI_FC=gfortran
Submitting jobs¶
Jobs are submitted via Slurm. Users accustomed to the concept of partitions will find them on this system but should not specify one other than debug and only for applicable jobs. Slurm will automatically select the best available partition at runtime. Forcing a particular partition will delay a job's start time or could prevent it from running at all.
The debug partition is limited to single node jobs, running up to 30 minutes, and may only use a maximum of 128G of memory.
See the Spectrum MPI section of the Slurm page for an example job script.
Each dcs job must request at least one GPU or else the sumbmission will fail
Using GPUs¶
When submitting GPU/CUDA jobs via Slurm, users must specify
--gres=gpu:#
to specify the number of GPUs desired per node. If
GPUs are not requested with a job, they will not be accessible.
For RPI users only: There are GPUs with both 16 GB and 32 GB memory
sizes available. If you require one or more GPUs with >16 GB then you
must specify it along with the number of GPUs, ex: --gres=gpu:32g:#
.
Do not explicitly request these GPUs unless you actually need them.
Note: The system currently forces node sharing to improve GPU utilization. Please make sure you specify the resources you need as part of your job (memory, GPUs, CPU cores, etc), not just the number of nodes and/or tasks. If an oversubscribed node causes an issue please contact support.
Using GPUs in exclusive mode¶
Slurm will set each GPU in an allocation to the CUDA "exclusive process"
mode when the "cuda-mode-exclusive" feature/constraint is requested, ex:
salloc --gres=gpu:2 -C cuda-mode-exclusive
. For applications using one
process per GPU this mode may be used as a safeguard to ensure that GPUs
are not oversubscribed. This mode is also recommended when running MPS.
Note: This is not to be confused with exclusive user access to the
GPU. Only one user may access a GPU regardless of the mode.
GPU-Direct¶
Spectrum MPI disables GPU-Direct by default. See the Slurm page for the syntax to enable GPU-Direct ('CUDA aware MPI').
Setting GPU-process Affinity¶
Use a CUDA runtime API call, such as cudaSetDevice
, to set
process-to-device affinity.
Setting Environment Variables¶
Running jobs with Spectrum MPI 10.3 may require setting environment
variables in the parent shell of each MPI process. For example, to set
the environment variable cake
and bar
you can add the following to
your Slurm run script:
export cake=42
export bar=21
mpirun -x cake -x bar <other arguments>
With OpenMPI newer than 1.10 , -x
should be replaced by
--mca mca_base_env_list
.
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#sect19
Using NVMe storage¶
To use the NVMe storage in a node, request it along with the job
specification: --gres=nvme
(This can be combined with other requests,
such as GPUs.) When the first job step starts, the system will
initialize the storage and create the path
/mnt/nvme/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}
.---
title: DCS Cluster
Front-end
All compute jobs must be submitted to the slurm queue and not directly run on the frontend node. Users running jobs on the frontend node will be locked out of the system.
Note: dcsfen02 is available to run debug jobs as well as acting as a front-end node. It is prioritized over other nodes to limit resource fragmentation and improve system utilization. This has implications for performance and resource availability on the node, particularly GPUs. For example, if a debug job has requested GPUs in exclusive mode they may not be available to run other code.
System information¶
System contains 268 nodes, each with:
- 2x IBM Power 9 processors clocked at 3.15 GHz. Each processor contains 20 cores with 4 hardware threads (160 logical processors per node).
- 512 GiB RAM
- 1.6 TB Samsung NVMe Flash
Within the total cluster, 16 nodes each contain:
- 4x NVIDIA Tesla V100 GPUs with 16 GiB of memory each
Within the total cluster, 252 nodes each contain:
- 6x NVIDIA Tesla V100 GPUs with 32 GiB of memory each
Four nodes each contain a Nallatech FPGA.
All nodes are connected with EDR Infiniband and connect to the unified File System.
Performance¶
A single Nvidia Volta V100 GPU can theoretically perform 7.5 TeraFLOPs double precision and achieves 855 GB/s on the Stream Triad benchmark. These results are from Nvidia Tesla V100 Data Sheet and a 2017 HotChips presentation
Combined, the two Power9 CPUs on an AiMOS node can perform 1 TeraFLOPs in double precision.
2 sockets * 20 cores/socket * 8 flops/cycle/core * 3.15 giga-cycles/second ~= 1 TeraFLOPs double precision
The flops/cycle/core is from Dirk Pleiter's SC18 tutorial "IBM POWER9 Processor,NVIDIA V100 GPU andIBM AC922 Node Hardware Architecture": https://indico-jsc.fz-juelich.de/event/84/session/4/contribution/8/material/slides/0.pdf
Note, to measure performance of your application it is strongly suggested that you disable node sharing with other jobs by passing --gres=gpu:6
(or --gres=gpu:16g:4
on the nodes with four GPUs) to Slurm.
Building software¶
Many packages for building software are available as modules. However, some tools are available without loading any modules and a subset of those packages can be overridden by modules. Please pay careful attention to which modules you have loaded.
Build systems/tools:
- ninja 1.7.2
- cmake 3.13.4 (cmake3)
- autoconf 2.69
- automake 1.13.4
Compilers:
- gcc 4.8.5
- clang/llvm 3.4.2
Currently the following are available as modules:
- automake 1.16.1
- bazel 0.17.2, 0.18.0, 0.18.1, 0.21.0
- ccache 3.5
- cmake 3.12.2, 3.12.3
- gcc 6.4.0, 6.5.0, 7.4.0, 8.1.0, 8.2.0
- xl/xl_r (xlC and xlf) 13.1.6, 16.1.0
- MPICH 3.2.1 (mpich module, built with XL compiler)
- CUDA 9.1, 10.0
CUDA¶
When mixing CUDA and MPI, please make sure an xl module is loaded and
nvcc is called with -ccbin $(CXX)
otherwise linking will fail.
CUDA code should be compiled with -arch=sm_70
for the Volta V100 GPUs.
XL MPICH Compiler Wrapper Flags¶
The default MPICH compiler wrapper flags -O3 -qipa -qhot
will perform
aggressive optimizations that could alter the semantics of your program.
If the compiler applies such an optimization the following warning
message will be displayed:
1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information.
Specifying flags on the command line will override these defaults. For
example, the following flags will respectively reduce the optimization
level, add debug symbols, and block semantic changing optimizations:
-O2 -g -qstrict
.
More information on the XL compiler options is here:
Modules¶
The following modules are installed via Spack environments, are considered experimental, and are subject to change.
GCC 8.4.1 + Spectrum MPI: recommended for C++ that cannot be built with XL
module use /gpfs/u/software/dcs-rhel8-spack-install/v0162gccSpectrum/lmod/linux-rhel8-ppc64le/Core/
module load spectrum-mpi/10.4-2ycgnlq
module load cmake/3.20.0/1 cuda/11.1-3vj4c72
export OMPI_CXX=g++
export OMPI_CC=gcc
export OMPI_FC=gfortran
Submitting jobs¶
Jobs are submitted via Slurm. Users accustomed to the concept of partitions will find them on this system but should not specify one other than debug and only for applicable jobs. Slurm will automatically select the best available partition at runtime. Forcing a particular partition will delay a job's start time or could prevent it from running at all.
The debug partition is limited to single node jobs, running up to 30 minutes, and may only use a maximum of 128G of memory.
See the Spectrum MPI section of the Slurm page for an example job script.
Each dcs job must request at least one GPU or else the sumbmission will fail
Using GPUs¶
When submitting GPU/CUDA jobs via Slurm, users must specify
--gres=gpu:#
to specify the number of GPUs desired per node. If
GPUs are not requested with a job, they will not be accessible.
For RPI users only: There are GPUs with both 16 GB and 32 GB memory
sizes available. If you require one or more GPUs with >16 GB then you
must specify it along with the number of GPUs, ex: --gres=gpu:32g:#
.
Do not explicitly request these GPUs unless you actually need them.
Note: The system currently forces node sharing to improve GPU utilization. Please make sure you specify the resources you need as part of your job (memory, GPUs, CPU cores, etc), not just the number of nodes and/or tasks. If an oversubscribed node causes an issue please contact support.
Using GPUs in exclusive mode¶
Slurm will set each GPU in an allocation to the CUDA "exclusive process"
mode when the "cuda-mode-exclusive" feature/constraint is requested, ex:
salloc --gres=gpu:2 -C cuda-mode-exclusive
. For applications using one
process per GPU this mode may be used as a safeguard to ensure that GPUs
are not oversubscribed. This mode is also recommended when running MPS.
Note: This is not to be confused with exclusive user access to the
GPU. Only one user may access a GPU regardless of the mode.
GPU-Direct¶
Spectrum MPI disables GPU-Direct by default. See the Slurm page for the syntax to enable GPU-Direct ('CUDA aware MPI').
Setting GPU-process Affinity¶
Use a CUDA runtime API call, such as cudaSetDevice
, to set
process-to-device affinity.
Setting Environment Variables¶
Running jobs with Spectrum MPI 10.3 may require setting environment
variables in the parent shell of each MPI process. For example, to set
the environment variable cake
and bar
you can add the following to
your Slurm run script:
export cake=42
export bar=21
mpirun -x cake -x bar <other arguments>
With OpenMPI newer than 1.10 , -x
should be replaced by
--mca mca_base_env_list
.
https://www.open-mpi.org/doc/v4.0/man1/mpirun.1.php#sect19
Using NVMe storage¶
To use the NVMe storage in a node, request it along with the job
specification: --gres=nvme
(This can be combined with other requests,
such as GPUs.) When the first job step starts, the system will
initialize the storage and create the path
/mnt/nvme/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}
.
The storage is not persistent between allocations. However, it may be used/shared by multiple job steps within an allocation, see Slurm job arrays.
CCI is exploring other configurations in which the NVMe storage can be operated. Please email any suggestions to support.
Profiling¶
One method for profiling is reading the time base registers (mftb, mftbu). An example of this is found in the FFTW cycle header.
The time base for the Power 9 processor is 512000000.
Debugging¶
To write core files when a job fails, when using Spectrum MPI, modify your mpirun command as follows:
mpirun
/bin/bash -c "ulimit -S -c unlimited &> /dev/null; /path/to/your/application
"
When your application crashes, it will produce core.####
files; one
for each failing process. For a single process hello world executable
the core file was 11MB. If a large MPI job fails several GB of core
files could be written to disk; be careful.
To see the location of the failure, run the following command:
gdb --core=core.#### /path/to/binary/that/crashed
Compiling with debug symbols and lower optimization will increase the amount of information in the core file that can be accessed with gdb.
Killing a hung process and writing core files¶
Add the ulimit
command to your slurm job script and submit your job.
Once it is hung, run the following command to signal its termination and
generate core dump files:
scancel --signal=ABRT
Documentation¶
IBM Spectrum MPI¶
IBM XL¶
CUDA and GPU programming¶
See also¶
Profiling¶
One method for profiling is reading the time base registers (mftb, mftbu). An example of this is found in the FFTW cycle header.
The time base for the Power 9 processor is 512000000.
Debugging¶
To write core files when a job fails, when using Spectrum MPI, modify your mpirun command as follows:
mpirun
/bin/bash -c "ulimit -S -c unlimited &> /dev/null; /path/to/your/application
"
When your application crashes, it will produce core.####
files; one
for each failing process. For a single process hello world executable
the core file was 11MB. If a large MPI job fails several GB of core
files could be written to disk; be careful.
To see the location of the failure, run the following command:
gdb --core=core.#### /path/to/binary/that/crashed
Compiling with debug symbols and lower optimization will increase the amount of information in the core file that can be accessed with gdb.
Killing a hung process and writing core files¶
Add the ulimit
command to your slurm job script and submit your job.
Once it is hung, run the following command to signal its termination and
generate core dump files:
scancel --signal=ABRT