Cluster Information – SLING user documentation

Slurm offers a range of commands for interacting with the cluster. In this section, we will explore some examples of using the sinfo, squeue, scontrol, and sacct commands, which provide valuable insights into the cluster’s configuration and status. For comprehensive information on all the commands supported by Slurm, please refer to the Slurm project website.

Command: sinfo

The command displays information about the cluster’s state, partitions (subdivisions of the cluster), nodes, and available computing resources. There is a multitude of options available to specify the information we want to display about the cluster. For more precise control over the output, we can refer to the (documentation) that provides details on the various options and switches available with the sinfo command.

Display general information about the cluster configuration:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
gridlong*    up 14-00:00:0      4 drain* nsc-gsv001,nsc-lou001,nsc-msv001,nsc-vfp002
gridlong*    up 14-00:00:0      3  down* nsc-fp003,nsc-gsv003,nsc-msv006
gridlong*    up 14-00:00:0      1  drain nsc-vfp001
gridlong*    up 14-00:00:0      3  alloc nsc-lou002,nsc-msv[003,018]
gridlong*    up 14-00:00:0      3   resv nsc-fp[005-006],nsc-msv002
gridlong*    up 14-00:00:0     24    mix nsc-fp[002,004,007-008],nsc-gsv[002,004-007],nsc-msv[004-005,007-017,019-020]
gridlong*    up 14-00:00:0      1   idle nsc-fp001
e7           up 14-00:00:0      2 drain* nsc-lou001,nsc-vfp002
e7           up 14-00:00:0      1  drain nsc-vfp001
e7           up 14-00:00:0      1  alloc nsc-lou002

In the above output, you can see the available logical partitions, their state, the time limit for jobs in each partition, and the lists of compute nodes associated with them. The output can be customized using appropriate options to display specific information based on your requirements.

Display detailed information about compute nodes:

$ sinfo --Node --long
Tue Jan 05 11:06:02 2021
NODELIST    NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
nsc-fp001       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp002       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp003       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp004       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp005       1 gridlong*    reserved 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp006       1 gridlong*    reserved 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp007       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-fp008       1 gridlong*   allocated 16      2:8:1  64200        0   1000 intel,gp none
nsc-gsv001      1 gridlong*    reserved 64     4:16:1 515970        0      1 AMD,bigm none
nsc-gsv002      1 gridlong*   allocated 64     4:16:1 515970        0      1 AMD,bigm none

The above output provides information about each compute node in the cluster, including its partition affiliation (PARTITION), current state (STATE), number of CPUs (CPUS), number of processor sockets (S), number of processor cores per socket (C), number of hardware threads (T), amount of system memory (MEMORY), and any assigned features (AVAIL_FEATURES) such as processor type, presence of GPUs, etc.

Cluster partitions may be reserved in advance for various reasons such as maintenance, workshops, or specific projects. An example of displaying active reservations in the NSC cluster is as follows:

The above output shows any active reservations in the cluster, along with the reservation duration and the list of nodes included in each reservation. Each reservation is associated with a user group that has exclusive access to it, allowing them to bypass waiting for job completion from users without reservations.

Command: squeue

In addition to cluster configuration, we are naturally interested in the job queue status. The squeue command allows us to inquire about jobs that are currently in the queue, running, or have already successfully or unsuccessfully completed (documentation).

Output of the current job queue status:

From the output, we can retrieve the identifier of each individual job, the partition on which it is running, the job name, the user who launched it, and the current job status.

Some of the important job states are:

  • PD (PenDing) – the job is waiting in the queue,
  • R (Running) – the job is running,
  • CG (CompletinG) – the job is completing,
  • CD (CompleteD) – the job has completed,
  • F (Failed) – there was an error during execution,
  • S (Suspended) – the job execution is temporarily suspended,
  • CA (CAnceled) – the job has been canceled,
  • TO (TimeOut) – the job has been terminated due to a time limit.

The output also provides information about the total job runtime and the list of nodes on which the job is running, or the reason why the job has not started yet.”

We are usually most interested in the status of jobs that we have launched ourselves. We can limit the output to jobs of a specific user using the --user option.

Example output of jobs owned by user gen012:

In addition, we can also limit the output to jobs in a specific state. This can be done using the --states option.

Example output of all currently pending (PD) jobs:

Command: scontrol

Sometimes we require more detailed information about a specific partition, node, or job. This information can be obtained using the scontrol command (documentation). Below are some examples of how to use this command.

Example output of more detailed information about a specific partition:

Example output of more detailed information about the compute node nsc-lou003:

Example output of more detailed information about the job with ID 387489:

We can also check which users have permission to use reserved nodes:

Command: sacct

With the sacct command, we can obtain more information about jobs in execution and those completed.

For example, we can check the status of all jobs from the last day:

We can also inquire about the details of a specific job:

Read more here: Source link