Now that you are connected to the head node, familiarize yourself with the cluster structure by running the following set of commands.
SLURM
SLURM from SchedMD is one of the batch schedulers that you can use in AWS ParallelCluster. For an overview of the SLURM commands, see the SLURM Quick Start User Guide.
- List existing partitions and nodes per partition. Running
sinfo
shows both the instances we currently have running and those that are not running (think of this as a queue limit). Initially we’ll see all the node in stateidle~
, this means no instances are running. When we submit a job we’ll see some instances go into statealloc
meaning they’re currently completely allocated, ormix
meaning some but not all cores are allocated. After the job completes the instance stays around for a few minutes (default cooldown is 10 mins) in stateidle%
. This can be confusing, so we’ve tried to simplify it in the below table:
State | Description |
---|---|
idle~ |
Instance is not running but can launch when a job is submitted. |
idle% |
Instance is running and will shut down after ScaledownIdletime (default 10 mins). |
mix |
Instance is partially allocated. |
alloc |
Instance is completely allocated. |
- List jobs in the queues or running. Obviously, there won’t be any since we did not submit anything…yet!
Module Environment
Environment Modules are a fairly standard tool in HPC that is used to dynamically change your environment variables (PATH
, LD_LIBRARY_PATH
, etc).
- List available modules You’ll notice that every cluster comes with
intelmpi
andopenmpi
pre-installed. These MPI versions are compiled with support for the high-speed interconnect EFA.
- Load a particular module. In this case, this command loads IntelMPI in your environment and checks the path of mpirun.
module load intelmpi
mpirun -V
Shared Filesystems
- List mounted NFS volumes. A few volumes are shared by the head-node and will be mounted on compute instances when they boot up. Both /shared and /home are accessible by all nodes.
- List shared filesystems. When we created the cluster, we also created a Lustre filesystem with FSx Lustre. We can see where it was mounted and the storage size by running:
You’ll see a line like:
172.31.21.202@tcp:/zm5lzbmv 1.1T 1.2G 1.1T 1% /shared
This is a 1.2 TB filesystem, mounted at /shared
that’s 1% used.
In the next section we’ll install Spack on this shared filesystem!
Read more here: Source link