.. _afw_hpc11_user_guide: ################################## Air Force Weather HPC11 User Guide ################################## .. _system-overview: *************** System Overview *************** The Air Force Weather (AFW) HPC11 resource is a pair of semi-redundant/autonomous machines. The two machines will be referred to as "Hall A" and "Hall B" in this document. These machines are accessed via several login nodes which provide users with a place for administrative tasks such as editing/compiling code and submitting/monitoring batch jobs. The machines are produced by Cray Computers (currently part of Hewlett Packard Enterprises) and are part of Cray/HPE's EX series. .. _compute-nodes: Compute Nodes ============= CPU-Only Nodes -------------- Each hall in HPC11 contains 812 compute nodes, each of which has dual 64-core AMD EPYC processors. Each of the 64 physical cores can support two hardware threads and thus can also appear as two virtual cores. 804 of the nodes are configured with 256GB of memory while the remaining 8 are configured with 1TB. .. note:: Each of the two processors on a CPU-only compute node has 4 NUMA domains, and each NUMA domain has 4 L3 cache regions. The hardware threads associated with each L3 cache & NUMA domain are shown in the node image below. .. figure:: /images/HPC11_CPU_Node_Diagram.png :align: center :alt: Simplified diagram of HPC11 CPU node HPC11 CPU Node Diagram CPU+GPU Nodes ------------- Each hall also contains 32 CPU+GPU nodes. Each of these nodes contains a single 64-core AMD EPYC processor, 4 NVIDIA A100 GPUs, and 256GB of memory. On 10 of the nodes, each GPU has 40GB of HBM2 high-bandwidth memory and on the remaining 22, each GPU has 80GB of HBM2. .. note:: Each CPU+GPU compute node contains a single 64-core AMD EPYC processor. Processors on CPU+GPU compute nodes are slightly different than those on the CPU-only compute nodes in that they have 4 NUMA domains with 2 L3 cache regions per NUMA domain. A GPU is associated with each NUMA domain. The hardware threads associated with each L3 cache/NUMA domain as well as the GPU associated with each NUMA doman are shown in the image below. .. figure:: /images/HPC11_GPU_Node_Diagram.png :align: center :alt: Simplified diagram of HPC11 GPU node HPC11 CPU+GPU Node Diagram All compute nodes mount two high-performance Lustre parallel filesystems and an NFS filesystem which provide read-only access to user home directories and shared project directories. Unless a hall is down for maintenance or some other reason, users will not see a distinction and will be able to treat HPC11 as two 800+ node computers only installed in one hall). All nodes within a hall are connected via a fast 100Gb Cray SlingShot interconnect. .. note:: See the HPC11-specific documentation delivered to AFLCMC for information on submitting to the different nodes. .. _login-nodes: Login Nodes =========== HPC11 has multiple login nodes that are automatically rotated via round-robin DNS. The login nodes contain the same processor configuration as the compute nodes, but have 1TB of memory. These nodes are intended for tasks such as editing/compiling code and managing jobs on the compute nodes. They are a shared resource used by all HPC11 users, and as such any CPU- or memory-intensive tasks on these nodes could interrupt service to other users. As a courtesy, we ask that you refrain from doing any significant analysis or visualization tasks on the login nodes. Login nodes are accessed via ssh. .. note:: For more information on connecting to the HPC11 resources, see the HPC11-specific documentation delivered to AFLCMC. .. _file-systems: File Systems ============ The OLCF manages multiple filesystems for HPC11. These filesystems provide both high-performance scratch areas for running jobs as well as user and project home directory space. There are two independent high-performance parallel Lustre scratch filesystems. Compute nodes in both halls mount both filesystems. Users may elect to store data on either filesystem (or both). Both of these filesystems are considered "scratch" in the sense that no data is automatically backed up or archived. All projects are assigned one of the two AFW lustre scratch filesystems as a default or primary fileystem but will have matching directories created in the secondary filesystem as well. Only the top-level project directories are automatically created. Users may elect to replicate directory structures and copy data between the filesystems (and are encouraged to do so), but there is no automated process to sync the two areas. The lustre filesystems are the intended as high-performance filesystems for use by running jobs. In addition to the Lustre filesystem, each user is granted a home directory that is mounted via NFS. Similarly, projects are provided with an NFS-mounted shared project directory. These filesystems are backed up, but space in these areas is limited via quotas. Additionally, these directories are mounted read-only on compute nodes so running jobs will not be able to write data to them. .. note:: For more information on the filesystems and how to use them, see the HPC11-specific documentation delivered to AFLCMC. ********************************** Shell and programming environments ********************************** HPC11 provides users with many software packages and scientific libraries installed at the system-level. These software packages are managed via Environment Modules which automaticall make the necessary changes to a user's environment to facilitate use of the software. This sectino discusses information on using modules on HPC11. Default Shell ============= A user's default shell is selected when completing the user account request form. The chosen shell is set across all OLCF-managed resources. Currently, supported shells include: - bash - tcsh - csh - ksh - zsh If you would like to have your default shell changed, please contact the `OLCF user assistance center `__ at afw-help@olcf.ornl.gov. Environment Management with Modules =================================== The HPC11 user environment is typically modified dynamically using *modules* (specifically, the Environment Modules software package). These modules aim to make software usage easier by automatically altering a user's environment to set environment variables such as ``PATH`` and ``LD_LIBRARY_PATH`` appropriately. Thus, users need not worry about modifying those variables directly, they simply need to load the desired module. The Cray Environment modules allow you to alter the software available in your shell environment with significantly less risk of creating package and version combinations that cannot coexist in a single environment. General Usage ------------- The interface to Lmod is provided by the ``module`` command: +------------------------------------+-----------------------------------------------------------------------+ | Command | Description | +====================================+=======================================================================+ | ``module -t list`` | Shows a terse list of the currently loaded modules. | +------------------------------------+-----------------------------------------------------------------------+ | ``module avail`` | Shows a table of the currently available modules | +------------------------------------+-----------------------------------------------------------------------+ | ``module help `` | Shows help information about | +------------------------------------+-----------------------------------------------------------------------+ | ``module show `` | Shows the environment changes made by the modulefile | +------------------------------------+-----------------------------------------------------------------------+ | ``module load [...]`` | Loads the given (s) into the current environment | +------------------------------------+-----------------------------------------------------------------------+ | ``module use `` | Adds to the modulefile search cache and ``MODULESPATH`` | +------------------------------------+-----------------------------------------------------------------------+ | ``module unuse `` | Removes from the modulefile search cache and ``MODULESPATH`` | +------------------------------------+-----------------------------------------------------------------------+ | ``module purge`` | Unloads all modules | +------------------------------------+-----------------------------------------------------------------------+ | ``module refresh`` | Unload then reload all currently loaded modulefiles | +------------------------------------+-----------------------------------------------------------------------+ | ``module swap `` | Swap module for (Frequently used for changing compilers | +------------------------------------+-----------------------------------------------------------------------+ Cray-Specific Modules --------------------- Many of the modules on the HPC11 machine are provided by Cray. These modules will be prefixed with "cray-" in the module name. Generally, loading these modules will add their libraries, include paths, etc. to the Cray compiler wrapper environment so users do not need to add specific include or library paths to compile applications. Installed Software ------------------ The OLCF provides some pre-installed software packages and scientific libraries within the system software environment for AF use. Additionally, the Cray programming environment includes many common libraries (e.g. netCDF, HDF5, etc). OLCF also provides an extensive Python Anaconda package with additional AFW-specific packages via the "afw-python" series of modules. AF users who find a general-purpose software package to be missing can request it through the HPC11 AFLCMC program office. AF user software applications, to include software libraries and mission-specific packages, are a user responsibility. Compiling ========= Compiling on HPC11 is similar to compiling on commodity clusters, but Cray provides compiler wrappers via their Programming Environment modules that make it much easier to build codes with commonly used packages (e.g. MPI, netCDF, HDF5, etc.) by automatically including the necessary compiler/linker flags for those packages (based on the modules that are currently loaded in the user's environment). The packages that are automatically included are typically those whose names are prefixed with "cray-" (for example, cray-netcdf). Available Compilers ------------------- The following compiler suites are available on HPC11: - `Cray Compiling Environment `__ - `GNU Compiler Collection `__ - `NVidia HPC SDK `__ Upon login, default versions of the Cray compiler and Cray's message passing interface (MPI) libraries are automatically added to each user's environment. Changing Compilers ------------------ Changing to a Different Compiler Suite ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ When changing to a different compiler suite (i.e. from Cray to GNU or vice versa), it's important to make sure the correct environment is set up for the new compiler. This includes changing relevant modules for MPI and other software. To aid users in pairing the correct compiler and environment, the module system on HPC11 provides "Programming Environment" modules that pull in support and scientific libraries specific to a compiler. Thus, when changing compilers it is important to do so via the PrgEnv-[compiler] module rather than the individual module specific to the compiler. For example, to change the default environment from the Cray compiler to GCC, you would use the following command: .. code:: $ module swap PrgEnv-cray PrgEnv-gnu This will automatically unload the current compiler and system libraries associated with it, load the new compiler environment, and load associated system libraries (e.g. MPI) as well. Changing Versions Within a Compiler Suite ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To use a specific *version* of a given compiler, you must first ensure the compiler's programming environment is loaded, and *then* swap to the correct compiler version. For example, to change from the default Cray programming environment to the GNU environment, and then to change to a non-default version of the gcc compiler (in this example, version 9.2.0), you would use: .. code:: $ module swap PrgEnv-cray PrgEnv-gnu $ module swap gcc gcc/9.2.0 .. note:: We recommend that users avoid "module purge" when using programming environment modules; rather, use the default module environment at the time of login and modify it as needed. Compiler Wrappers ----------------- The HPC11 Programming Environment provides wrapper scripts for the compiler families and system libraries: - ``cc`` invokes the C compiler - ``CC`` invokes the C++ compiler - ``ftn`` invokes the Fortran compiler The wrapper script is independent of the back-end compiler (Cray or GNU) that is being used. Thus, there isn't a need to remember different names for the C/C++/Fortran compilers (which can vary from vendor to vendor). The ``cc``, ``CC``, and ``ftn`` commands/wrapper scripts will always be available and will call the appropriate vendor's compiler. Additionally, the wrappers automatically pass the required include and library paths to add things like MPI, netCDF, HDF5, etc., provided the corresponding "cray-" modules (e.g. cray-netcdf) are also loaded. Compiling MPI Codes ------------------- The compiler wrappers discussed in the previous section automatically link in MPI libraries. Thus, it is very simple to compile codes with MPI support: - C: ``$ cc -o my_mpi_program.x my_mpi_program.c`` - C++: ``$ CC -o my_mpi_program.x my_mpi_program.cxx`` - Fortran: ``$ ftn -o my_mpi_program.x my_mpi_program.f90`` Compiling OpenMP Threaded Codes ------------------------------- OpenMP support is disabled by default, so you must add a flag to the compile line to enable it within your executable. The flag differs slightly between different compilers as shown below. +------------------+----------+---------------+-----------------------------------------------------------+ | Programming | Language | Flag | Example(s) | | Environment | | | | +==================+==========+===============+===========================================================+ | ``PrgEnv-cray`` | C | ``-fopenmp`` | ``$ cc -fopenmp -o my_mpi_program.x my_omp_program.c`` | | +----------+ +-----------------------------------------------------------+ | | C++ | | ``$ CC -fopenmp -o my_mpi_program.x my_omp_program.cxx`` | | +----------+---------------+-----------------------------------------------------------+ | |Fortran | ``-homp`` or | ``$ ftn -fopenmp -o my_mpi_program.x my_omp_program.f90`` | | | | ``-fopenmp`` | | +------------------+----------+---------------+-----------------------------------------------------------+ | ``PrgEnv-nvhpc`` | C | ``-mp`` | ``$ cc -mp -o my_mpi_program.x my_omp_program.c`` | | +----------+ +-----------------------------------------------------------+ | | C++ | | ``$ CC -mp -o my_mpi_program.x my_omp_program.cxx`` | | +----------+ +-----------------------------------------------------------+ | | Fortran | | ``$ ftn -mp -o my_mpi_program.x my_omp_program.f90`` | +------------------+----------+---------------+-----------------------------------------------------------+ | ``PrgEnv-gnu`` | C | ``-fopenmp`` | ``$ cc -fopenmp -o my_mpi_program.x my_omp_program.c`` | | +----------+ +-----------------------------------------------------------+ | | C++ | | ``$ CC -fopenmp -o my_mpi_program.x my_omp_program.cxx`` | | +----------+ +-----------------------------------------------------------+ | | Fortran | | ``$ ftn -fopenmp -o my_mpi_program.x my_omp_program.f90`` | +------------------+----------+---------------+-----------------------------------------------------------+ For more information on *running threaded codes*, please see the :ref:`thread-layout` subsection of the :ref:`hpc11-running-jobs` section in this user guide. .. note:: A special case of OpenMP is OpenMP Offloading, which is a directive-based approach to using GPUs (sometimes called "accelerators") in your code. For information on offloading, see the section below. Compiling GPU-Enabled Codes --------------------------- There are several ways to build codes for the A100 GPUs. These include using the CUDA programming language, as well as directive based approaches like OpenMP Offloading and OpenACC. When working with GPU technology it's common to see references to *host* code and *device* code. *Host* code is code that is intended to run on the CPU, while *device* code is code to run on the GPU (which is also sometimes generically referred to as an *accelerator*). .. note:: The software necessary for compiling GPU-enabled codes is only available on the GPU nodes. You will need to start an interactive job targeting the GPU partition to access the modules that allow you to build GPU codes. For more information on targeting the GPU partition, see the HPC11-specific documentation provided to AFLCMC. CUDA ^^^^ CUDA is a programming language that allows you to write code that will run on GPUs by creating specific subprograms, called kernels, that contain GPU code. Several tutorials for using CUDA are available on the Oak Ridge Leadership Computing Facility's `Training Archive `__. A basic `introduction `__ to CUDA is available on NVIDIA's website. CUDA files typically have a ``.cu`` suffix. If you use this naming scheme, the system will recognize the file as needing CUDA compilation and will automatically call the correct back-end compilers. WIth the compiler wrappers, it is possible to have a single source file that mixes MPI code with CUDA code, and compile it with a single command. As noted above, you must first start an intractive job on the GPU partition. Once there, you must load needed modules to build CUDA codes: .. code:: module load cudatoolkit craype-accel-nvidia80 In general, the compiler wrappers will link in the needed CUDA libraries automatically, however, you will need to add a link flag for the NVIDIA Management Library. So, a sample CUDA compilation might be: .. code:: cc -o my_program.x my_program.cu -lnvidia-ml You can then run this program via ``srun`` as described in that section of the documentation. Linking CUDA-Enabled Libraries """""""""""""""""""""""""""""" When building with ``PrgEnv-nvhpc``, the ``--cudalib`` flag can be used to tell the compiler to link certain CUDA-enabled libraries. This flag accepts a comma-separated list of libraries to add to the link line. For example, to link in CUDA-enabled BLAS and FFT libraries, you would use ``--cudalib=cublas,cufft``. The Cray and GNU compilers do not support that flag, so you will need to link any needed CUDA-enabled libraries in the usual manner (with the ``-l`` option on the compile/link line). CUDA-Aware MPI """""""""""""" A special case of CUDA is CUDA-Aware MPI. With CUDA-Aware MPI, users can use device buffers directly in MPI commands. This alleviates the need to transfer buffers between device and host before and after the relevant MPI call. The MPI call is all that is necessary. To enable CUDA-Aware MPI, set the environment variable ``MPICH_GPU_SUPPORT_ENABLED`` to ``1`` in your batch job prior to the srun command. For ksh/bash/zsh, use ``export MPICH_GPU_SUPPORT_ENABLED=1``; for csh/tcsh, use ``setenv MPICH_GPU_SUPPORT_ENABLED 1``. OpenMP Offloading ^^^^^^^^^^^^^^^^^ OpenMP Offloading is a directive-based approach to using GPUs/accelerators. Rather than creating specific subroutines to use the GPUs as is the case with CUDA, with OpenMP offloading you insert directives in your code that instruct the compiler to create certain parts of your code as "device" code. To use OpenMP Offloading, you must have the ``craype-accel-nvidia80`` module loaded when you compile and run. Additionally, you must provide an appropriate flag to the compiler to enable OpenMP. For ``PrgEnv-gnu`` and ``PrgEnv-cray``, this is the same flag described above to enable OpenMP in general (``-fopenmp`` or ``-homp``). For ``PrgEnv-nvhpc``, you must specify ``-mp=gpu``. Compiling Hybrid Codes ---------------------- It's common for codes to have a mix of programming models, for example MPI along with OpenMP. When compiling these codes, you can simply combine the options shown above for each programming model. For example, if you have a code that combines MPI, OpenMP threading, and CUDA, compiling with the Cray compilers could be as simple as: .. code:: cc -o my_program.x -fopenmp my_program.cu -lnvidia-ml .. _hpc11-running-jobs: ************ Running Jobs ************ In High Performance Computing (HPC), computational work is performed by *jobs*. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system. Jobs on HPC11 typically comprises a few different components: - A batch submission script. - A binary executable. - A set of input files for the executable. - A set of output files created by the executable. And the process for running a job, in general, is to: #. Prepare executables and input files. #. Write a batch script. #. Submit the batch script to the batch scheduler. #. Optionally monitor the job before and during execution. The following sections describe in detail how to create, submit, and manage jobs for execution on HPC11. Login vs Compute Nodes ====================== As described in the :ref:`system-overview`, HPC11 consists of both login and compute nodes. When you initiallly log into the HPC11 machine, you are placed on a *login* node. Login node resources are shared by all users of the system. Users should be mindful of this when running tasks on the login nodes...Login nodes should be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should *not* be used for memory- or compute-intensive tasks. Users should also limit the number of simultaneous tasks performed on the login resources. For example, a user should not run 10 simultaneous ``tar`` processes on a login node or specify a large number to the ``-j`` parallel make option. .. note:: Users should not use ``make -j`` without supplying an argument to the ``-j`` option. If ``-j`` is specified without an argument, it make will launch an number of tasks equal to the number of cores on the login node. This will adversely affect all users on the node. Compute nodes are the appropriate location for large, long-running, computationally-intensive jobs. Compute nodes are requested via the Slurm batch scheduler, as described below. .. warning:: Compute-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning. Slurm ===== The HPC11 resources use the `Slurm `__ batch scheduler. Previously, the HPC10 resource used the LSF scheduler. While there are similarities between different scheduling systems, the commands differ. The table below provides a comparision of useful/typical commands for each scheduler. +------------------------------------+---------------------+---------------+ | Task | LSF (HPC10) | Slurm (HPC11) | +====================================+=====================+===============+ | View batch queue | ``bjobs`` | ``squeue`` | +------------------------------------+---------------------+---------------+ | Submit batch script | ``bsub`` | ``sbatch`` | +------------------------------------+---------------------+---------------+ | Submit interactive batch job | ``bsub -Is $SHELL`` | ``salloc`` | +------------------------------------+---------------------+---------------+ | Run parallel code within batch job | ``mpirun`` | ``srun`` | +------------------------------------+---------------------+---------------+ | Abort a queued or running job | ``bkill`` | ``scancel`` | +------------------------------------+---------------------+---------------+ Node Exclusivity ---------------- The scheduler on HPC11 uses a non-exclusive node policy by default. This means that, resources permitting, the system is free to place multiple jobs per node; however, the system will not place jobs from multiple users on any node...nodes will only be shared among jobs from one user. In practice, this scheduling policy permits more efficient use of system resources by giving the scheduler the ability to "pack" several small jobs from a given user on a single node instead of requiring each to run on a separate node. There are several caveats to non-exclusive node assignment. By default, the system will allocate 2GB of memory per core. This can be modified with the ``--mem-per-cpu`` flag; however, there is a maximum of 4GB/core. If you use a ``--mem-per-cpu`` flag larger than that, the system will allocate an additional core for each additional 4GB memory block (or fraction thereof) that you request. For example, if you request ``--mem-per-cpu=10G``, the system will allocate 3 cores even if you've only requested 1. Should you want exclusive node assignment, you need to specify the ``--exclusive`` Slurm option either on your ``sbatch``/``salloc`` command line or within your Slurm batch script. Additionally, you should request ``--mem=0`` to guarantee that the system makes all memory on each node available to your job. Writing Batch Scripts --------------------- Batch scripts, or job submission scripts, are the mechanism by which a user configures and submits a job for execution. A batch script is simply a shell script that also includes commands to be interpreted by the batch scheduling software (e.g. Slurm). Batch scripts are submitted to the batch scheduler, where they are then parsed for the scheduling configuration options. The batch scheduler then places the script in the appropriate queue, where it is designated as a batch job. Once the batch jobs makes its way through the queue, the script will be executed on the primary compute node of the allocated resources. Batch scripts are submitted for execution using the ``sbatch`` command. For example, the following will submit the batch script named ``test.slurm``: .. code:: sbatch test.slurm If successfully submitted, a Slurm job ID will be returned. This ID can be used to track the job. It is also helpful in troubleshooting a failed job; make a note of the job ID for each of your jobs in case you must contact the `OLCF User Assistance Center `__ for support. Components of a Batch Script ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Batch scripts are parsed into the following three sections: * | **Interpreter Line** | The first line of a script can be used to specify the script’s interpreter; this line is optional. If not used, the submitter’s default shell will be used. The line uses the *hash-bang* syntax, i.e., ``#!/path/to/shell``. * | **Slurm Submission Options** | The Slurm submission options are preceded by the string ``#SBATCH``, making them appear as comments to a shell. Slurm will look for ``#SBATCH`` options in a batch script from the script’s first line through the first non-comment, non-whitespace line. (A comment line begins with ``#``.) ``#SBATCH`` options entered after the first non-comment line will not be read by Slurm. * | **Shell Commands** | The shell commands follow the last ``#SBATCH`` option and represent the executable content of the batch job. If any ``#SBATCH`` lines follow executable statements, they will be treated as comments only. | | The execution section of a script will be interpreted by a shell and can contain multiple lines of executables, shell commands, and comments. when the job's queue wait time is finished, commands within this section will be executed on the primary compute node of the job's allocated resources. Under normal circumstances, the batch job will exit the queue after the last line of the script is executed. Example Batch Scripts ^^^^^^^^^^^^^^^^^^^^^ Using Non-Exclusive Nodes (default) """"""""""""""""""""""""""""""""""" .. code-block:: bash :linenos: #!/bin/bash #SBATCH -A ABC123 #SBATCH -J test #SBATCH -N 1 #SBATCH -t 1:00:00 #SBATCH --mem-per-core=3G #SBATCH --cluster-constraint=green cd $SLURM_SUBMIT_DIR date srun -N1 -n 4 ./a.out This batch script shows examples of the three sections outlined above: +-------------------------------------------------------------------------------------------------------------------+ | **Interpreter Line** | +----+------------------------------------------+-------------------------------------------------------------------+ | 1 | ``#!/bin/bash`` | This line is optional. It is used to specify a shell to interpret | | | | the script. In this example, the Bash shell will be used. | +----+------------------------------------------+-------------------------------------------------------------------+ | **Slurm Options** | +----+------------------------------------------+-------------------------------------------------------------------+ | 2 | ``#SBATCH -A ABC123`` | The job will be charged to the “ABC123” project. | +----+------------------------------------------+-------------------------------------------------------------------+ | 3 | ``#SBATCH -J test`` | The job will be named test. | +----+------------------------------------------+-------------------------------------------------------------------+ | 4 | ``#SBATCH -N 1`` | The job will request 1 node. | +----+------------------------------------------+-------------------------------------------------------------------+ | 4 | ``#SBATCH --ntasks-per-node=4`` | The job will be allocated resources to support 4 tasks per node. | +----+------------------------------------------+-------------------------------------------------------------------+ | 5 | ``#SBATCH -t 1:00:00`` | The job will request a walltime of 1 hour. | +----+------------------------------------------+-------------------------------------------------------------------+ | 6 | ``#SBATCH --mem-per-cpu=3G`` | Each core will be allocated 3GB of memory. | +----+------------------------------------------+-------------------------------------------------------------------+ | 7 | ``#SBATCH --cluster-constraint=green`` | The job will run on a cluster with the 'green' label | +----+------------------------------------------+-------------------------------------------------------------------+ | **Shell Commands** | +----+------------------------------------------+-------------------------------------------------------------------+ | 8 | | This line is left blank, so it will be ignored. | +----+------------------------------------------+-------------------------------------------------------------------+ | 9 | ``cd $SLURM_SUBMIT_DIR`` | This command will change the current directory to the directory | | | | from which the script was submitted. | +----+------------------------------------------+-------------------------------------------------------------------+ | 10 | ``date`` | This command will run the date command. | +----+------------------------------------------+-------------------------------------------------------------------+ | 11 | ``srun -N1 -n4 ./a.out`` | This command will run 4 MPI instances of the executable a.out on | | | | the compute node allocated by the batch systema. | +----+------------------------------------------+-------------------------------------------------------------------+ Using Exclusive Nodes """"""""""""""""""""" .. code-block:: bash :linenos: #!/bin/bash #SBATCH -A ABC123 #SBATCH -J test #SBATCH -N 2 #SBATCH -t 1:00:00 #SBATCH --exclusive #SBATCH --mem=0 #SBATCH --cluster-constraint=green cd $SLURM_SUBMIT_DIR date srun -N2 -n 16 ./a.out This batch script shows examples of the three sections outlined above: +-------------------------------------------------------------------------------------------------------------------+ | **Interpreter Line** | +----+------------------------------------------+-------------------------------------------------------------------+ | 1 | ``#!/bin/bash`` | This line is optional. It is used to specify a shell to interpret | | | | the script. In this example, the Bash shell will be used. | +----+------------------------------------------+-------------------------------------------------------------------+ | **Slurm Options** | +----+------------------------------------------+-------------------------------------------------------------------+ | 2 | ``#SBATCH -A ABC123`` | The job will be charged to the “ABC123” project. | +----+------------------------------------------+-------------------------------------------------------------------+ | 3 | ``#SBATCH -J test`` | The job will be named test. | +----+------------------------------------------+-------------------------------------------------------------------+ | 4 | ``#SBATCH -N 2`` | The job will request 2 nodes. | +----+------------------------------------------+-------------------------------------------------------------------+ | 5 | ``#SBATCH -t 1:00:00`` | The job will request a walltime of 1 hour. | +----+------------------------------------------+-------------------------------------------------------------------+ | 6 | ``#SBATCH --exclusive`` | The job will run in exclusive mode (no other jobs will be placed | | | | on any nodes allocated to this job) | +----+------------------------------------------+-------------------------------------------------------------------+ | 7 | ``#SBATCH --mem=0`` | All memory on the node will be made available to the job | +----+------------------------------------------+-------------------------------------------------------------------+ | 8 | ``#SBATCH --cluster-constraint=green`` | The job will run on a cluster with the 'green' label | +----+------------------------------------------+-------------------------------------------------------------------+ | **Shell Commands** | +----+------------------------------------------+-------------------------------------------------------------------+ | 9 | | This line is left blank, so it will be ignored. | +----+------------------------------------------+-------------------------------------------------------------------+ | 10 | ``cd $SLURM_SUBMIT_DIR`` | This command will change the current directory to the directory | | | | from which the script was submitted. | +----+------------------------------------------+-------------------------------------------------------------------+ | 11 | ``date`` | This command will run the date command. | +----+------------------------------------------+-------------------------------------------------------------------+ | 12 | ``srun -N2 -n 16 ./a.out`` | This command will run 16 MPI instances of the executable a.out, | | | | spread out across the 2 nodes allocated by the system. | | | | By default this will be 8 tasks per node. | +----+------------------------------------------+-------------------------------------------------------------------+ Interactive Batch Jobs ^^^^^^^^^^^^^^^^^^^^^^ Batch scripts are useful when one has a pre-determined group of commands to execute, the results of which can be viewed at a later time. However, it is often necessary to run tasks on compute resources interactively. Users are not allowed to access compute nodes directly from a login node. Instead, users must use an *interactive batch job* to allocate and gain access to compute resources. This is done by using the Slurm ``salloc`` command. The ``salloc`` command accepts many of the same arguments that would be provided in a batch script. For example, the following command requests an interative allocation on a cluster with the "green" label (``--cluster-constraint=green``) to be charged to project ABC123 (``-A ABC123``), using 4 nodes (``-N 4``) in exclusive mode (``--exclusive``), with all memory on the nodes made available to the job (``--mem=0``), and with a maximum walltime of 1 hour (``-t 1:00:00``): .. code:: $ salloc -A ABC123 -N 4 --exclusive --mem=0 -t 1:00:00 --cluster-constraint=green While ``salloc`` does provide interactive access, it does not necessarily do so immediately. The job must still wait for resources to become available in the same way as any other batch job. Once resources become available, the job will start and the user will be given an interactive prompt on the primary compute node within the set of nodes allocated to the job. Commands may then be executed directly on the command line (rather than through a batch script). To run a parallel application across the set of allocated comput enodes, use the ``srun`` command just as you would in a batch script. Debugging """"""""" A common use of interactive batch is to aid in debugging efforts. Interactive access to compute resources allows the ability to run a process to the point of failure; however, unlike a batch job, the process can be restarted after brief changes are made without losing the compute resource pool; thus speeding up the debugging effort. Choosing a Job Size """"""""""""""""""" Because interactive jobs must sit in the queue until enough resources become available to allocate, it is useful to know when a job can start. Use the ``sbatch --test-only`` command to see when a job of a specific size could be scheduled. For example, the snapshot below shows that a 2 node job in exclusive mode would start at 10:54. .. code:: $ sbatch --test-only --cluster-constraint-green -N2 --exclusive --mem=0 -t1:00:00 batch-script.slurm sbatch: Job 1375 to start at 2019-08-06T10:54:01 using 512 processors on nodes node[0100,0121] in partition batch .. note:: The queue is fluid, thus the given time is an estimate made from the current queue state and load.a Future job submissions and job completions will alter the estimate. .. _common-batch-options: Common Batch Options ^^^^^^^^^^^^^^^^^^^^ The following table summarizes frequently-used options for ``sbatch`` and ``salloc``. When using ``salloc``, options are specified on the command line. When using ``sbatch``, options may be specified on the command line or in the batch script file. If they're in the batch script file, they must be preceeded with ``#SBATCH`` as described above. See the ``salloc`` and ``sbatch`` man pages for a complete description of each option as well as the other options available. +--------------------------+------------------------------------------+------------------------------------------------------------+ | Option | Use | Description | +==========================+==========================================+============================================================+ | ``-A`` | ``#SBATCH -A `` | Specify the account/project to which the job should be | | | | charged. The account string, e.g. ``ABC123`` is typically | | | | composed of three letters followed by three digits and | | | | optionally followed by a subproject identifier. You can | | | | view your assigned projects via the myOLCF User Portal. | +--------------------------+------------------------------------------+------------------------------------------------------------+ | ``-N`` | ``#SBATCH -N `` | Number of compute nodes to allocate. Jobs will be | | | | allocated 'whole' (i.e. dedicated/non-shared) nodes unless | | | | running in a "shared" partition. See the HPC11-specific | | | | documentation provided to AFLCMC for more information on | | | | HPC11 partitions. | +--------------------------+------------------------------------------+------------------------------------------------------------+ | ``-t`` | ``#SBATCH -t