Beginner’s guide to Slurm | Center for High Performance Computing

The CHPC uses Slurm to manage resource scheduling and job submission. Users submit jobs on the login node. The queueing system, also known as the job scheduler, will determine when and where to run your jobs. Slurm will factor in the computational requirements of the job, including (but not limited to) the number of CPUs, the number of GPUs, CPU memory, GPU memory, and walltime (the requested total running time for the job).

A simple batch script

A batch script is the convenient tool by which you communicate with the job scheduler on how to process your jobs. It can be thought of as an extension to a shell script.

A typical batch script consists of two parts: the first part declaring the requested computational resources for your job (e.g. the number of CPU cores, the amount of time, etc.) and the second part listing the commands need to execute for your job.

The part declaring computational resources follows the syntax:

#SBATCH -OPTION ARGUMENT

CPU request

The max number of CPU you can request per node is 30.

For all your jobs

The CHPC has multiple compute nodes (i.e. servers), and each compute node has multiple CPUs. To request CPUs, you need first to specify how many compute nodes are needed for your job, and then for each compute node how many CPU cores you need.

The number of compute nodes is specified by the option -N, or ‐‐nodes. For starters, we strongly recommend to request a single node for your jobs, that is, specifying #SBATCH -N 1 or #SBATCH ‐‐nodes 1.

The number of CPU cores within a compute node is specified by the option -n, or ‐‐ntasks. For starters, we strongly recommend to request a single task for your jobs, that is specifying #SBATCH -n 1 or #SBATCH ‐‐ntasks 1.

If your application automatically employs multiple threads or you know that your application can run across multiple CPU cores within a single node, also known as multi-threaded or OpenMP applications, you can use ‐‐cpus-per-task in your batch script to request more than 1 CPU cores but no greater than 32 (maximum number of cores in a single compute node). See details on how to run multi-threaded jobs in the advanced instructions.

If you are unfamiliar with the application, you might begin with a test run by requesting 1 CPU core, then re-run the same job with 2 CPU cores. If there is no difference in walltime, the application is most likely serial. If the walltime decreases, consider running the same job with 4 and 8 CPU cores to see if the application continues to scale. If it is scaling properly, the walltime should decrease linearly with respect to the number of CPU cores.

CPU memory request

The option for specifying CPU memory in Slurm is CPU memory per node, ‐‐mem. Typically, most jobs request only a single compute node, so this is equivalent to the total CPU memory for your job. This parameter is typically requested in units of megabytes or gigabytes, for instance, #SBATCH ‐‐mem 300M or #SBATCH ‐‐mem 20G.

GPU request

The CHPC has multiple GPU nodes, and many GPU nodes have dissimilar types and numbers of GPUs. To request GPUs, you need first to specify how many GPU nodes for your job, and then for each GPU node how many GPUs you need.

The number of GPU nodes is specified by the option -N, or ‐‐nodes, exactly the same as you specify for requesting CPU nodes. Since most GPU applications cannot run across multiple GPU nodes, we strongly recommend to request a single GPU node for your jobs, that is, specifying #SBATCH -N 1 or #SBATCH ‐‐nodes 1.

The number of GPUs within a GPU node is specified by the option ‐‐gres. The simplest argument for ‐‐gres is gpu:X, where X is the number of GPUs within a single GPU node. The maximum number for X is 4. To request 1 GPU for your job, you can specify #SBATCH ‐‐gres gpu:1. You can also request a particular type of GPU for your job, check here.

GPU memory request

If you know how much GPU memory your job would use, and prefer to request a particular type of GPU that meets this need, you can specify GPU memory using the option ‐‐gres with argument vmem. This is an advanced topic and the relevant information can be found here.

Walltime request

For all jobs, it is required to explicitly specify walltime. Otherwise, your jobs would be defaulted to test_1_4h, this particular QOS, regardless of how many CPU cores your jobs request, which may lead to job pending forever due to confliction between your job requirement and partition criteria.

The option for specifying walltime for your job is -t, or ‐‐time. The format of argument for this option is DD-HH:MM:SS. An example of requesting 36 hours, 40 minutes and 50 seconds of walltime is shown below. Both will work as they are equivalent.

#SBATCH -t 1-12:40:50
#SBATCH ‐‐time 36:40:50

Batch script example

Putting it all together, an example batch script for a job requesting 1 CPU, 1 GPU, 200 MB of CPU memory and walltime of 10 minutes is shown as below:

#!/bin/bash

#SBATCH -J Sample_Job      ########    Job Name: Sample_Job    ########
#SBATCH -N 1               ########     Number of nodes: 1     ########
#SBATCH -n 1               ########     Number of tasks: 1     ########
#SBATCH ‐‐gres gpu:1       ######## Number of gpus per node: 1 ########
#SBATCH ‐‐mem 200M         ########   Memory per node: 200 MB  ########
#SBATCH -t 00:10:00        ########    Walltime: 10 minutes    ########

hostname
sleep 300

These options specified for #SBATCH can be passed to “srun” command line, but it is more convenient to include them directly in the batch script. More detailed information on available options in a batch script can be found in the advanced instructions.

Note: when writing your own batch script, always specifying the computational resources starting with “#SBATCH” before any of your command lines as shown in the above exemplary batch script. Otherwise, all options specified with “#SBATCH” will be completely ignored by Slurm.

Basic commands

Submit batch script

Once you have the batch script ready, you can simply submit this one script to the queuing system so that the queuing system would automatically handle and run your jobs.

[xinghuang@login01 ~]$ sbatch MY_BATCH_SCRIPT

If the above command is submitted without any error message, it would return a jobid for the submitted job (see the example below).

[xinghuang@login01 ~]$ sbatch runjob
Submitted batch job 132879

Check the status of submitted jobs

You can run squeue command to check the status of your submitted jobs. You can use either squeue -u your_username.

[xinghuang@login01 ~]$ squeue -u xinghuang
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            173080     small     Job1    xingh CG   02:44:35      1 node08
            173081       gpu     Job2    xingh  R      18:27      1 gpu05
            173082    medium     Job3    xingh PD       0:00      1 (Priority)

The most common job status you would see in the ST column includes R (Running), CG (Completing) and PD (Pending). Additional job status and explanation can be found here.

NOTE

  • When checking your job status, you would notice that a column named PARTITION. Partition is a unique terminology used by Slurm, referring to different types of job queues defined in the job scheduler. You don’t need to worry about the partition, as the job scheduler would automatically assign the proper partition to your jobs based on the computational resources requested by your jobs. You do need to specify partition explicitly if you need to run jobs on the high memory node. For more detailed information on the partition, check here.
  • You may want to pay attention to the last column in the output from checking your job status. Especially for your pending jobs, the column NODELIST (REASON) gives some hint on why your jobs are queued. If it shows “ReqNodeNotAvail, May be reserved for other job”, it suggests your jobs may require to run on reserved nodes, which may be run within limited time frame. If it shows “launch failed requeued held”, you may need to check the syntax error of your batch script, or unreasonable resources requested in your batch script. If it shows “DependencyNeverSatisfied”, it is likely that the prerequisite job fails so that the dependent jobs would never run. You don’t need to worry much if other reasons are shown.

Cancel/delete submitted/running jobs

After running squeue command to check the status of your submitted jobs, you can find for each job a job id is assigned by the job scheduler in the JOBID column. If you would like to cancel or delete some submitted or running jobs, you can use command scancel JOBID.

[xinghuang@login01 ~]$ scancel 173082

Interactive job

To request an interactive job on a CPU node, you need to run command like:

[xinghuang@login01 ~]$ srun -N 1 -n 8 ‐‐mem 400M ‐‐time 00:20:00 ‐‐pty bash
[xinghuang@node17 ~]$

To request an interactive job on a GPU node, you need to run command like:

[xinghuang@login01 ~]$ srun ‐‐gres gpu:1 -n 4 ‐‐mem 1G ‐‐time 01:30:00 ‐‐pty bash
[xinghuang@gpu01 ~]$

To request an interactive job requesting both CPU and GPU nodes, you need to run command like:

[xinghuang@login01 ~]$ srun -N 2 -n 8 ‐‐gres gpu:2 ‐‐time 1:00:00 ‐‐mem 1G ‐‐pty bash
[xinghuang@gpu01 ~]$

To request an interactive job with X11 forwarding, you can run srun command with flag ‐‐x11:

[xinghuang@login01 ~]$ srun -N 1 -n 1 ‐‐mem 40M ‐‐time 00:10:00 ‐‐x11 ‐‐pty bash
[xinghuang@node17 ~]$

Troubleshooting

Slurm generates standard output file and error file in the location you submit your jobs after your jobs complete. These files are in the format of JOBNAME.oJOBID and JOBNAME.eJOBID by default and helpful for you to understand how your jobs perform.

A typical example of the standard output is composed of three parts, output from prologue at the top part, output from your job in the middle part and output from epilogue at the bottom part reporting your job performance.

--------------------------------------------------------------
Begin Slurm Prologue Tue Oct 19 12:12:50 CDT 2021 1634663570
Job ID:         521490
Username:       xinghuang
Partition:      small
End Slurm Prologue Tue Oct 19 12:12:50 CDT 2021 1634663570
--------------------------------------------------------------
time = 0.000
time = 0.100
time = 0.200
time = 0.300
time = 0.400
time = 0.500
time = 0.600
time = 0.700
time = 0.800
time = 0.900
time = 1.000
--------------------------------------------------------------
Begin Slurm Epilogue Tue Oct 19 12:13:02 CDT 2021 1634663582
Name                : Sample_Job
User                : xinghuang
Partition           : small
Nodes               : node24
Cores               : 4
State               : COMPLETED
Submit              : 2021-10-19T12:12:49
Start               : 2021-10-19T12:12:50
End                 : 2021-10-19T12:12:57
Reserved Walltime   : 00:30:00
Used Walltime       : 00:00:07
Used CPU Time       : 00:00:24
% User (Computation): 99.54%
% System (I/O)      :  0.00%
Mem Reserved        : 4G/node
Max Mem Used        : 1.25M (1310720.0)
Max Disk Write      : 0.00  (0.0)
Max Disk Read       : 0.00  (0.0)
Max-Mem-Used Node   : node24
Max-Disk-Write Node : node24
Max-Disk-Read Node  : node24
End Slurm Epilogue Tue Oct 19 12:13:02 CDT 2021 1634663582
--------------------------------------------------------------

Used Walltime shows exactly how long your job needs from start to finish. This gives you an idea of the optimal walltime request for your job.

Mem Reserved and Max Mem Used can be compared to see if the requested CPU memory is optimal for your job. In principle, to minimize wasting allocated CPU memory, requested CPU memory should not be too apart from the maximum CPU memory used.

If your job requires longer time than your requested walltime, or larger CPU memory than your requested CPU memory, your job will be killed for violating your job request. These are the two basic parameters facilitating debugging your failed jobs.

Read more here: Source link