.. _afw_hpc11_user_guide:
##################################
Air Force Weather HPC11 User Guide
##################################
.. _system-overview:
***************
System Overview
***************
The Air Force Weather (AFW) HPC11 resource is a pair of semi-redundant/autonomous machines. The two
machines will be referred to as "Hall A" and "Hall B" in this document. These machines are accessed
via several login nodes which provide users with a place for administrative tasks such as
editing/compiling code and submitting/monitoring batch jobs. The machines are produced by Cray
Computers (currently part of Hewlett Packard Enterprises) and are part of Cray/HPE's EX series.
.. _compute-nodes:
Compute Nodes
=============
CPU-Only Nodes
--------------
Each hall in HPC11 contains 812 compute nodes, each of which has dual 64-core AMD EPYC processors.
Each of the 64 physical cores can support two hardware threads and thus can also appear as two
virtual cores. 804 of the nodes are configured with 256GB of memory while the remaining 8 are
configured with 1TB.
.. note::
Each of the two processors on a CPU-only compute node has 4 NUMA domains, and each NUMA domain
has 4 L3 cache regions. The hardware threads associated with each L3 cache & NUMA domain are
shown in the node image below.
.. figure:: /images/HPC11_CPU_Node_Diagram.png
:align: center
:alt: Simplified diagram of HPC11 CPU node
HPC11 CPU Node Diagram
CPU+GPU Nodes
-------------
Each hall also contains 32 CPU+GPU nodes. Each of these nodes contains a single 64-core AMD EPYC
processor, 4 NVIDIA A100 GPUs, and 256GB of memory. On 10 of the nodes, each GPU has 40GB of HBM2
high-bandwidth memory and on the remaining 22, each GPU has 80GB of HBM2.
.. note::
Each CPU+GPU compute node contains a single 64-core AMD EPYC processor. Processors on CPU+GPU
compute nodes are slightly different than those on the CPU-only compute nodes in that they have 4
NUMA domains with 2 L3 cache regions per NUMA domain. A GPU is associated with each NUMA domain.
The hardware threads associated with each L3 cache/NUMA domain as well as the GPU associated with
each NUMA doman are shown in the image below.
.. figure:: /images/HPC11_GPU_Node_Diagram.png
:align: center
:alt: Simplified diagram of HPC11 GPU node
HPC11 CPU+GPU Node Diagram
All compute nodes mount two high-performance Lustre parallel filesystems and an NFS filesystem which
provide read-only access to user home directories and shared project directories. Unless a hall is
down for maintenance or some other reason, users will not see a distinction and will be able to
treat HPC11 as two 800+ node computers only installed in one hall). All nodes within a hall are
connected via a fast 100Gb Cray SlingShot interconnect.
.. note::
See the HPC11-specific documentation delivered to AFLCMC for information on submitting to the
different nodes.
.. _login-nodes:
Login Nodes
===========
HPC11 has multiple login nodes that are automatically rotated via round-robin DNS. The login nodes
contain the same processor configuration as the compute nodes, but have 1TB of memory. These nodes
are intended for tasks such as editing/compiling code and managing jobs on the compute nodes. They
are a shared resource used by all HPC11 users, and as such any CPU- or memory-intensive tasks on
these nodes could interrupt service to other users. As a courtesy, we ask that you refrain from
doing any significant analysis or visualization tasks on the login nodes. Login nodes are accessed
via ssh.
.. note::
For more information on connecting to the HPC11 resources, see the HPC11-specific
documentation delivered to AFLCMC.
.. _file-systems:
File Systems
============
The OLCF manages multiple filesystems for HPC11. These filesystems provide both high-performance
scratch areas for running jobs as well as user and project home directory space.
There are two independent high-performance parallel Lustre scratch filesystems. Compute nodes in
both halls mount both filesystems. Users may elect to store data on either filesystem (or both).
Both of these filesystems are considered "scratch" in the sense that no data is automatically backed
up or archived. All projects are assigned one of the two AFW lustre scratch filesystems as a default
or primary fileystem but will have matching directories created in the secondary filesystem as well.
Only the top-level project directories are automatically created. Users may elect to replicate
directory structures and copy data between the filesystems (and are encouraged to do so), but there
is no automated process to sync the two areas. The lustre filesystems are the intended as
high-performance filesystems for use by running jobs.
In addition to the Lustre filesystem, each user is granted a home directory that is mounted via NFS.
Similarly, projects are provided with an NFS-mounted shared project directory. These filesystems
are backed up, but space in these areas is limited via quotas. Additionally, these directories are
mounted read-only on compute nodes so running jobs will not be able to write data to them.
.. note::
For more information on the filesystems and how to use them, see the HPC11-specific
documentation delivered to AFLCMC.
**********************************
Shell and programming environments
**********************************
HPC11 provides users with many software packages and scientific libraries installed at the
system-level. These software packages are managed via Environment Modules which automaticall make
the necessary changes to a user's environment to facilitate use of the software. This sectino
discusses information on using modules on HPC11.
Default Shell
=============
A user's default shell is selected when completing the user account request form. The chosen shell
is set across all OLCF-managed resources. Currently, supported shells include:
- bash
- tcsh
- csh
- ksh
- zsh
If you would like to have your default shell changed, please contact the
`OLCF user assistance center `__ at
afw-help@olcf.ornl.gov.
Environment Management with Modules
===================================
The HPC11 user environment is typically modified dynamically using *modules* (specifically, the
Environment Modules software package). These modules aim to make software usage easier by
automatically altering a user's environment to set environment variables such as ``PATH`` and
``LD_LIBRARY_PATH`` appropriately. Thus, users need not worry about modifying those variables
directly, they simply need to load the desired module.
The Cray Environment modules allow you to alter the software available in your shell environment
with significantly less risk of creating package and version combinations that cannot coexist in a
single environment.
General Usage
-------------
The interface to Lmod is provided by the ``module`` command:
+------------------------------------+-----------------------------------------------------------------------+
| Command | Description |
+====================================+=======================================================================+
| ``module -t list`` | Shows a terse list of the currently loaded modules. |
+------------------------------------+-----------------------------------------------------------------------+
| ``module avail`` | Shows a table of the currently available modules |
+------------------------------------+-----------------------------------------------------------------------+
| ``module help `` | Shows help information about |
+------------------------------------+-----------------------------------------------------------------------+
| ``module show `` | Shows the environment changes made by the modulefile |
+------------------------------------+-----------------------------------------------------------------------+
| ``module load [...]`` | Loads the given (s) into the current environment |
+------------------------------------+-----------------------------------------------------------------------+
| ``module use `` | Adds to the modulefile search cache and ``MODULESPATH`` |
+------------------------------------+-----------------------------------------------------------------------+
| ``module unuse `` | Removes from the modulefile search cache and ``MODULESPATH`` |
+------------------------------------+-----------------------------------------------------------------------+
| ``module purge`` | Unloads all modules |
+------------------------------------+-----------------------------------------------------------------------+
| ``module refresh`` | Unload then reload all currently loaded modulefiles |
+------------------------------------+-----------------------------------------------------------------------+
| ``module swap `` | Swap module for (Frequently used for changing compilers |
+------------------------------------+-----------------------------------------------------------------------+
Cray-Specific Modules
---------------------
Many of the modules on the HPC11 machine are provided by Cray. These modules will be prefixed with
"cray-" in the module name. Generally, loading these modules will add their libraries, include
paths, etc. to the Cray compiler wrapper environment so users do not need to add specific include or
library paths to compile applications.
Installed Software
------------------
The OLCF provides some pre-installed software packages and scientific libraries within the system
software environment for AF use. Additionally, the Cray programming environment includes many common
libraries (e.g. netCDF, HDF5, etc). OLCF also provides an extensive Python Anaconda package with
additional AFW-specific packages via the "afw-python" series of modules. AF users who find a
general-purpose software package to be missing can request it through the HPC11 AFLCMC program
office. AF user software applications, to include software libraries and mission-specific packages,
are a user responsibility.
Compiling
=========
Compiling on HPC11 is similar to compiling on commodity clusters, but Cray provides compiler
wrappers via their Programming Environment modules that make it much easier to build codes with
commonly used packages (e.g. MPI, netCDF, HDF5, etc.) by automatically including the necessary
compiler/linker flags for those packages (based on the modules that are currently loaded in the
user's environment). The packages that are automatically included are typically those whose names
are prefixed with "cray-" (for example, cray-netcdf).
Available Compilers
-------------------
The following compiler suites are available on HPC11:
- `Cray Compiling Environment `__
- `GNU Compiler Collection `__
- `NVidia HPC SDK `__
Upon login, default versions of the Cray compiler and Cray's message passing interface (MPI)
libraries are automatically added to each user's environment.
Changing Compilers
------------------
Changing to a Different Compiler Suite
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
When changing to a different compiler suite (i.e. from Cray to GNU or vice versa), it's important to
make sure the correct environment is set up for the new compiler. This includes changing relevant
modules for MPI and other software. To aid users in pairing the correct compiler and environment,
the module system on HPC11 provides "Programming Environment" modules that pull in support and
scientific libraries specific to a compiler. Thus, when changing compilers it is important to do so
via the PrgEnv-[compiler] module rather than the individual module specific to the compiler. For
example, to change the default environment from the Cray compiler to GCC, you would use the
following command:
.. code::
$ module swap PrgEnv-cray PrgEnv-gnu
This will automatically unload the current compiler and system libraries associated with it, load
the new compiler environment, and load associated system libraries (e.g. MPI) as well.
Changing Versions Within a Compiler Suite
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To use a specific *version* of a given compiler, you must first ensure the compiler's programming
environment is loaded, and *then* swap to the correct compiler version. For example, to change from
the default Cray programming environment to the GNU environment, and then to change to a non-default
version of the gcc compiler (in this example, version 9.2.0), you would use:
.. code::
$ module swap PrgEnv-cray PrgEnv-gnu
$ module swap gcc gcc/9.2.0
.. note::
We recommend that users avoid "module purge" when using programming environment modules; rather,
use the default module environment at the time of login and modify it as needed.
Compiler Wrappers
-----------------
The HPC11 Programming Environment provides wrapper scripts for the compiler families and system
libraries:
- ``cc`` invokes the C compiler
- ``CC`` invokes the C++ compiler
- ``ftn`` invokes the Fortran compiler
The wrapper script is independent of the back-end compiler (Cray or GNU) that is being used. Thus,
there isn't a need to remember different names for the C/C++/Fortran compilers (which can vary from
vendor to vendor). The ``cc``, ``CC``, and ``ftn`` commands/wrapper scripts will always be available
and will call the appropriate vendor's compiler. Additionally, the wrappers automatically pass the
required include and library paths to add things like MPI, netCDF, HDF5, etc., provided the
corresponding "cray-" modules (e.g. cray-netcdf) are also loaded.
Compiling MPI Codes
-------------------
The compiler wrappers discussed in the previous section automatically link in MPI libraries. Thus,
it is very simple to compile codes with MPI support:
- C: ``$ cc -o my_mpi_program.x my_mpi_program.c``
- C++: ``$ CC -o my_mpi_program.x my_mpi_program.cxx``
- Fortran: ``$ ftn -o my_mpi_program.x my_mpi_program.f90``
Compiling OpenMP Threaded Codes
-------------------------------
OpenMP support is disabled by default, so you must add a flag to the compile line to enable it
within your executable. The flag differs slightly between different compilers as shown below.
+------------------+----------+---------------+-----------------------------------------------------------+
| Programming | Language | Flag | Example(s) |
| Environment | | | |
+==================+==========+===============+===========================================================+
| ``PrgEnv-cray`` | C | ``-fopenmp`` | ``$ cc -fopenmp -o my_mpi_program.x my_omp_program.c`` |
| +----------+ +-----------------------------------------------------------+
| | C++ | | ``$ CC -fopenmp -o my_mpi_program.x my_omp_program.cxx`` |
| +----------+---------------+-----------------------------------------------------------+
| |Fortran | ``-homp`` or | ``$ ftn -fopenmp -o my_mpi_program.x my_omp_program.f90`` |
| | | ``-fopenmp`` | |
+------------------+----------+---------------+-----------------------------------------------------------+
| ``PrgEnv-nvhpc`` | C | ``-mp`` | ``$ cc -mp -o my_mpi_program.x my_omp_program.c`` |
| +----------+ +-----------------------------------------------------------+
| | C++ | | ``$ CC -mp -o my_mpi_program.x my_omp_program.cxx`` |
| +----------+ +-----------------------------------------------------------+
| | Fortran | | ``$ ftn -mp -o my_mpi_program.x my_omp_program.f90`` |
+------------------+----------+---------------+-----------------------------------------------------------+
| ``PrgEnv-gnu`` | C | ``-fopenmp`` | ``$ cc -fopenmp -o my_mpi_program.x my_omp_program.c`` |
| +----------+ +-----------------------------------------------------------+
| | C++ | | ``$ CC -fopenmp -o my_mpi_program.x my_omp_program.cxx`` |
| +----------+ +-----------------------------------------------------------+
| | Fortran | | ``$ ftn -fopenmp -o my_mpi_program.x my_omp_program.f90`` |
+------------------+----------+---------------+-----------------------------------------------------------+
For more information on *running threaded codes*, please see the :ref:`thread-layout` subsection of
the :ref:`hpc11-running-jobs` section in this user guide.
.. note::
A special case of OpenMP is OpenMP Offloading, which is a directive-based approach to using GPUs
(sometimes called "accelerators") in your code. For information on offloading, see the section
below.
Compiling GPU-Enabled Codes
---------------------------
There are several ways to build codes for the A100 GPUs. These include using the CUDA programming
language, as well as directive based approaches like OpenMP Offloading and OpenACC. When working
with GPU technology it's common to see references to *host* code and *device* code. *Host* code is
code that is intended to run on the CPU, while *device* code is code to run on the GPU (which is
also sometimes generically referred to as an *accelerator*).
.. note::
The software necessary for compiling GPU-enabled codes is only available on the GPU nodes. You
will need to start an interactive job targeting the GPU partition to access the modules that
allow you to build GPU codes. For more information on targeting the GPU partition, see the
HPC11-specific documentation provided to AFLCMC.
CUDA
^^^^
CUDA is a programming language that allows you to write code that will run on GPUs by creating
specific subprograms, called kernels, that contain GPU code. Several tutorials for using CUDA are
available on the Oak Ridge Leadership Computing Facility's `Training Archive
`__. A basic `introduction
`__ to CUDA is available on
NVIDIA's website.
CUDA files typically have a ``.cu`` suffix. If you use this naming scheme, the system will recognize
the file as needing CUDA compilation and will automatically call the correct back-end compilers.
WIth the compiler wrappers, it is possible to have a single source file that mixes MPI code with
CUDA code, and compile it with a single command.
As noted above, you must first start an intractive job on the GPU partition. Once there, you
must load needed modules to build CUDA codes:
.. code::
module load cudatoolkit craype-accel-nvidia80
In general, the compiler wrappers will link in the needed CUDA libraries automatically, however, you
will need to add a link flag for the NVIDIA Management Library. So, a sample CUDA compilation might
be:
.. code::
cc -o my_program.x my_program.cu -lnvidia-ml
You can then run this program via ``srun`` as described in that section of the documentation.
Linking CUDA-Enabled Libraries
""""""""""""""""""""""""""""""
When building with ``PrgEnv-nvhpc``, the ``--cudalib`` flag can be used to tell the compiler to link
certain CUDA-enabled libraries. This flag accepts a comma-separated list of libraries to add to the
link line. For example, to link in CUDA-enabled BLAS and FFT libraries, you would use
``--cudalib=cublas,cufft``.
The Cray and GNU compilers do not support that flag, so you will need to link any needed
CUDA-enabled libraries in the usual manner (with the ``-l`` option on the compile/link line).
CUDA-Aware MPI
""""""""""""""
A special case of CUDA is CUDA-Aware MPI. With CUDA-Aware MPI, users can use device buffers directly
in MPI commands. This alleviates the need to transfer buffers between device and host before and
after the relevant MPI call. The MPI call is all that is necessary. To enable CUDA-Aware MPI, set
the environment variable ``MPICH_GPU_SUPPORT_ENABLED`` to ``1`` in your batch job prior to the srun
command. For ksh/bash/zsh, use ``export MPICH_GPU_SUPPORT_ENABLED=1``; for csh/tcsh, use
``setenv MPICH_GPU_SUPPORT_ENABLED 1``.
OpenMP Offloading
^^^^^^^^^^^^^^^^^
OpenMP Offloading is a directive-based approach to using GPUs/accelerators. Rather than creating
specific subroutines to use the GPUs as is the case with CUDA, with OpenMP offloading you insert
directives in your code that instruct the compiler to create certain parts of your code as "device"
code.
To use OpenMP Offloading, you must have the ``craype-accel-nvidia80`` module loaded when you compile
and run. Additionally, you must provide an appropriate flag to the compiler to enable OpenMP. For
``PrgEnv-gnu`` and ``PrgEnv-cray``, this is the same flag described above to enable OpenMP in
general (``-fopenmp`` or ``-homp``). For ``PrgEnv-nvhpc``, you must specify ``-mp=gpu``.
Compiling Hybrid Codes
----------------------
It's common for codes to have a mix of programming models, for example MPI along with OpenMP. When
compiling these codes, you can simply combine the options shown above for each programming model.
For example, if you have a code that combines MPI, OpenMP threading, and CUDA, compiling with the
Cray compilers could be as simple as:
.. code::
cc -o my_program.x -fopenmp my_program.cu -lnvidia-ml
.. _hpc11-running-jobs:
************
Running Jobs
************
In High Performance Computing (HPC), computational work is performed by *jobs*. Individual jobs
produce data that lend relevant insight into grand challenges in science and engineering. As such,
the timely, efficient execution of jobs is the primary concern in the operation of any HPC system.
Jobs on HPC11 typically comprises a few different components:
- A batch submission script.
- A binary executable.
- A set of input files for the executable.
- A set of output files created by the executable.
And the process for running a job, in general, is to:
#. Prepare executables and input files.
#. Write a batch script.
#. Submit the batch script to the batch scheduler.
#. Optionally monitor the job before and during execution.
The following sections describe in detail how to create, submit, and manage jobs
for execution on HPC11.
Login vs Compute Nodes
======================
As described in the :ref:`system-overview`, HPC11 consists of both login and compute nodes. When you
initiallly log into the HPC11 machine, you are placed on a *login* node. Login node resources are
shared by all users of the system. Users should be mindful of this when running tasks on the login
nodes...Login nodes should be used for basic tasks such as file editing, code compilation, data
backup, and job submission. Login nodes should *not* be used for memory- or compute-intensive tasks.
Users should also limit the number of simultaneous tasks performed on the login resources. For
example, a user should not run 10 simultaneous ``tar`` processes on a login node or specify a large
number to the ``-j`` parallel make option.
.. note::
Users should not use ``make -j`` without supplying an argument to the ``-j`` option. If ``-j``
is specified without an argument, it make will launch an number of tasks equal to the number of
cores on the login node. This will adversely affect all users on the node.
Compute nodes are the appropriate location for large, long-running,
computationally-intensive jobs. Compute nodes are requested via the Slurm batch
scheduler, as described below.
.. warning::
Compute-intensive, memory-intensive, or otherwise disruptive processes running on login nodes
may be killed without warning.
Slurm
=====
The HPC11 resources use the `Slurm `__ batch
scheduler. Previously, the HPC10 resource used the LSF scheduler. While there are similarities
between different scheduling systems, the commands differ. The table below provides a
comparision of useful/typical commands for each scheduler.
+------------------------------------+---------------------+---------------+
| Task | LSF (HPC10) | Slurm (HPC11) |
+====================================+=====================+===============+
| View batch queue | ``bjobs`` | ``squeue`` |
+------------------------------------+---------------------+---------------+
| Submit batch script | ``bsub`` | ``sbatch`` |
+------------------------------------+---------------------+---------------+
| Submit interactive batch job | ``bsub -Is $SHELL`` | ``salloc`` |
+------------------------------------+---------------------+---------------+
| Run parallel code within batch job | ``mpirun`` | ``srun`` |
+------------------------------------+---------------------+---------------+
| Abort a queued or running job | ``bkill`` | ``scancel`` |
+------------------------------------+---------------------+---------------+
Node Exclusivity
----------------
The scheduler on HPC11 uses a non-exclusive node policy by default. This
means that, resources permitting, the system is free to place multiple jobs
per node; however, the system will not place jobs from multiple users on
any node...nodes will only be shared among jobs from one user. In practice,
this scheduling policy permits more efficient use of system resources by
giving the scheduler the ability to "pack" several small jobs from a given
user on a single node instead of requiring each to run on a separate node.
There are several caveats to non-exclusive node assignment. By default, the
system will allocate 2GB of memory per core. This can be modified with the
``--mem-per-cpu`` flag; however, there is a maximum of 4GB/core. If you use
a ``--mem-per-cpu`` flag larger than that, the system will allocate an
additional core for each additional 4GB memory block (or fraction thereof)
that you request. For example, if you request ``--mem-per-cpu=10G``, the
system will allocate 3 cores even if you've only requested 1.
Should you want exclusive node assignment, you need to specify the
``--exclusive`` Slurm option either on your ``sbatch``/``salloc`` command
line or within your Slurm batch script. Additionally, you should request
``--mem=0`` to guarantee that the system makes all memory on each node
available to your job.
Writing Batch Scripts
---------------------
Batch scripts, or job submission scripts, are the mechanism by which a user
configures and submits a job for execution. A batch script is simply a shell
script that also includes commands to be interpreted by the batch scheduling
software (e.g. Slurm).
Batch scripts are submitted to the batch scheduler, where they are then parsed
for the scheduling configuration options. The batch scheduler then places the
script in the appropriate queue, where it is designated as a batch job. Once the
batch jobs makes its way through the queue, the script will be executed on the
primary compute node of the allocated resources.
Batch scripts are submitted for execution using the ``sbatch`` command.
For example, the following will submit the batch script named ``test.slurm``:
.. code::
sbatch test.slurm
If successfully submitted, a Slurm job ID will be returned. This ID can be used
to track the job. It is also helpful in troubleshooting a failed job; make a
note of the job ID for each of your jobs in case you must contact the `OLCF User
Assistance Center `__ for
support.
Components of a Batch Script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Batch scripts are parsed into the following three sections:
* | **Interpreter Line**
| The first line of a script can be used to specify the script’s interpreter; this line is
optional. If not used, the submitter’s default shell will be used. The line uses the
*hash-bang* syntax, i.e., ``#!/path/to/shell``.
* | **Slurm Submission Options**
| The Slurm submission options are preceded by the string ``#SBATCH``, making them appear as
comments to a shell. Slurm will look for ``#SBATCH`` options in a batch script from the script’s
first line through the first non-comment, non-whitespace line. (A comment line begins with ``#``.)
``#SBATCH`` options entered after the first non-comment line will not be read by Slurm.
* | **Shell Commands**
| The shell commands follow the last ``#SBATCH`` option and represent the executable content of
the batch job. If any ``#SBATCH`` lines follow executable statements, they will be treated as
comments only.
|
| The execution section of a script will be interpreted by a shell and can contain multiple
lines of executables, shell commands, and comments. when the job's queue wait time is finished,
commands within this section will be executed on the primary compute node of the job's allocated
resources. Under normal circumstances, the batch job will exit the queue after the last line of
the script is executed.
Example Batch Scripts
^^^^^^^^^^^^^^^^^^^^^
Using Non-Exclusive Nodes (default)
"""""""""""""""""""""""""""""""""""
.. code-block:: bash
:linenos:
#!/bin/bash
#SBATCH -A ABC123
#SBATCH -J test
#SBATCH -N 1
#SBATCH -t 1:00:00
#SBATCH --mem-per-core=3G
#SBATCH --cluster-constraint=green
cd $SLURM_SUBMIT_DIR
date
srun -N1 -n 4 ./a.out
This batch script shows examples of the three sections outlined above:
+-------------------------------------------------------------------------------------------------------------------+
| **Interpreter Line** |
+----+------------------------------------------+-------------------------------------------------------------------+
| 1 | ``#!/bin/bash`` | This line is optional. It is used to specify a shell to interpret |
| | | the script. In this example, the Bash shell will be used. |
+----+------------------------------------------+-------------------------------------------------------------------+
| **Slurm Options** |
+----+------------------------------------------+-------------------------------------------------------------------+
| 2 | ``#SBATCH -A ABC123`` | The job will be charged to the “ABC123” project. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 3 | ``#SBATCH -J test`` | The job will be named test. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 4 | ``#SBATCH -N 1`` | The job will request 1 node. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 4 | ``#SBATCH --ntasks-per-node=4`` | The job will be allocated resources to support 4 tasks per node. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 5 | ``#SBATCH -t 1:00:00`` | The job will request a walltime of 1 hour. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 6 | ``#SBATCH --mem-per-cpu=3G`` | Each core will be allocated 3GB of memory. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 7 | ``#SBATCH --cluster-constraint=green`` | The job will run on a cluster with the 'green' label |
+----+------------------------------------------+-------------------------------------------------------------------+
| **Shell Commands** |
+----+------------------------------------------+-------------------------------------------------------------------+
| 8 | | This line is left blank, so it will be ignored. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 9 | ``cd $SLURM_SUBMIT_DIR`` | This command will change the current directory to the directory |
| | | from which the script was submitted. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 10 | ``date`` | This command will run the date command. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 11 | ``srun -N1 -n4 ./a.out`` | This command will run 4 MPI instances of the executable a.out on |
| | | the compute node allocated by the batch systema. |
+----+------------------------------------------+-------------------------------------------------------------------+
Using Exclusive Nodes
"""""""""""""""""""""
.. code-block:: bash
:linenos:
#!/bin/bash
#SBATCH -A ABC123
#SBATCH -J test
#SBATCH -N 2
#SBATCH -t 1:00:00
#SBATCH --exclusive
#SBATCH --mem=0
#SBATCH --cluster-constraint=green
cd $SLURM_SUBMIT_DIR
date
srun -N2 -n 16 ./a.out
This batch script shows examples of the three sections outlined above:
+-------------------------------------------------------------------------------------------------------------------+
| **Interpreter Line** |
+----+------------------------------------------+-------------------------------------------------------------------+
| 1 | ``#!/bin/bash`` | This line is optional. It is used to specify a shell to interpret |
| | | the script. In this example, the Bash shell will be used. |
+----+------------------------------------------+-------------------------------------------------------------------+
| **Slurm Options** |
+----+------------------------------------------+-------------------------------------------------------------------+
| 2 | ``#SBATCH -A ABC123`` | The job will be charged to the “ABC123” project. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 3 | ``#SBATCH -J test`` | The job will be named test. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 4 | ``#SBATCH -N 2`` | The job will request 2 nodes. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 5 | ``#SBATCH -t 1:00:00`` | The job will request a walltime of 1 hour. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 6 | ``#SBATCH --exclusive`` | The job will run in exclusive mode (no other jobs will be placed |
| | | on any nodes allocated to this job) |
+----+------------------------------------------+-------------------------------------------------------------------+
| 7 | ``#SBATCH --mem=0`` | All memory on the node will be made available to the job |
+----+------------------------------------------+-------------------------------------------------------------------+
| 8 | ``#SBATCH --cluster-constraint=green`` | The job will run on a cluster with the 'green' label |
+----+------------------------------------------+-------------------------------------------------------------------+
| **Shell Commands** |
+----+------------------------------------------+-------------------------------------------------------------------+
| 9 | | This line is left blank, so it will be ignored. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 10 | ``cd $SLURM_SUBMIT_DIR`` | This command will change the current directory to the directory |
| | | from which the script was submitted. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 11 | ``date`` | This command will run the date command. |
+----+------------------------------------------+-------------------------------------------------------------------+
| 12 | ``srun -N2 -n 16 ./a.out`` | This command will run 16 MPI instances of the executable a.out, |
| | | spread out across the 2 nodes allocated by the system. |
| | | By default this will be 8 tasks per node. |
+----+------------------------------------------+-------------------------------------------------------------------+
Interactive Batch Jobs
^^^^^^^^^^^^^^^^^^^^^^
Batch scripts are useful when one has a pre-determined group of commands to execute, the results of
which can be viewed at a later time. However, it is often necessary to run tasks on compute
resources interactively.
Users are not allowed to access compute nodes directly from a login node. Instead, users must use an
*interactive batch job* to allocate and gain access to compute resources. This is done by using the
Slurm ``salloc`` command. The ``salloc`` command accepts many of the same arguments that would be
provided in a batch script. For example, the following command requests an interative allocation on
a cluster with the "green" label (``--cluster-constraint=green``) to be charged to project ABC123
(``-A ABC123``), using 4 nodes (``-N 4``) in exclusive mode (``--exclusive``), with all memory on the
nodes made available to the job (``--mem=0``), and with a maximum walltime of 1 hour (``-t 1:00:00``):
.. code::
$ salloc -A ABC123 -N 4 --exclusive --mem=0 -t 1:00:00 --cluster-constraint=green
While ``salloc`` does provide interactive access, it does not necessarily do so immediately. The
job must still wait for resources to become available in the same way as any other batch job. Once
resources become available, the job will start and the user will be given an interactive prompt on
the primary compute node within the set of nodes allocated to the job. Commands may then be
executed directly on the command line (rather than through a batch script). To run a parallel
application across the set of allocated comput enodes, use the ``srun`` command just as you would in
a batch script.
Debugging
"""""""""
A common use of interactive batch is to aid in debugging efforts. Interactive access to compute
resources allows the ability to run a process to the point of failure; however, unlike a batch job,
the process can be restarted after brief changes are made without losing the compute resource pool;
thus speeding up the debugging effort.
Choosing a Job Size
"""""""""""""""""""
Because interactive jobs must sit in the queue until enough resources become
available to allocate, it is useful to know when a job can start.
Use the ``sbatch --test-only`` command to see when a job of a specific size
could be scheduled. For example, the snapshot below shows that a 2 node job
in exclusive mode would start at 10:54.
.. code::
$ sbatch --test-only --cluster-constraint-green -N2 --exclusive --mem=0 -t1:00:00 batch-script.slurm
sbatch: Job 1375 to start at 2019-08-06T10:54:01 using 512 processors on nodes node[0100,0121] in partition batch
.. note::
The queue is fluid, thus the given time is an estimate made from the current queue state and load.a
Future job submissions and job completions will alter the estimate.
.. _common-batch-options:
Common Batch Options
^^^^^^^^^^^^^^^^^^^^
The following table summarizes frequently-used options for ``sbatch`` and ``salloc``. When using
``salloc``, options are specified on the command line. When using ``sbatch``, options may be
specified on the command line or in the batch script file. If they're in the batch script file, they
must be preceeded with ``#SBATCH`` as described above.
See the ``salloc`` and ``sbatch`` man pages for a complete description of each option as well as the
other options available.
+--------------------------+------------------------------------------+------------------------------------------------------------+
| Option | Use | Description |
+==========================+==========================================+============================================================+
| ``-A`` | ``#SBATCH -A `` | Specify the account/project to which the job should be |
| | | charged. The account string, e.g. ``ABC123`` is typically |
| | | composed of three letters followed by three digits and |
| | | optionally followed by a subproject identifier. You can |
| | | view your assigned projects via the myOLCF User Portal. |
+--------------------------+------------------------------------------+------------------------------------------------------------+
| ``-N`` | ``#SBATCH -N `` | Number of compute nodes to allocate. Jobs will be |
| | | allocated 'whole' (i.e. dedicated/non-shared) nodes unless |
| | | running in a "shared" partition. See the HPC11-specific |
| | | documentation provided to AFLCMC for more information on |
| | | HPC11 partitions. |
+--------------------------+------------------------------------------+------------------------------------------------------------+
| ``-t`` | ``#SBATCH -t