Hi all,
I’m
facing the following issue with a DGX A100 machine: I’m able to
allocate resources, but the job fail when I try to execute srun, follow a
detailed analysis of the incident:
“`
$ salloc -n1 -N1 -p DEBUG -w dgx001 –time=2:0:0
salloc: Granted job allocation 1278
salloc: Waiting for resource configuration
salloc: Nodes dgx001 are ready for job
$ srun hostname
srun: error: slurm_receive_msgs: [[dgx001.hpc]:6818] failed: Socket timed out on send/recv operation
srun: error: Task launch for StepId=1278.0 failed on node dgx001: Socket timed out on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted
“`
The DGX Slurm daemon version is:
“`
$ slurmd -V
slurm 22.05.8
“`
“`
$ uname -a
Linux dgx001.hpc 5.4.0-137-generic #154-Ubuntu SMP Thu Jan 5 17:03:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04.5 LTS
Release: 20.04
Codename: focal
“`
With cgroup/v2 enabled as follow:
“`
$ cat /etc/default/grub | grep cgroup
GRUB_CMDLINE_LINUX=”systemd.unified_cgroup_hierarchy=1 cgroup_enable=memory swapaccount=1″
“`
Daemon status, even if cgroup/v2 is used, still present the process `slurmstepd` inside `slurmd.service` (the process 2250748 slurmstepd doesn’t appear in the other machine under slurmd service)
“`
$ systemctl status slurmd
● slurmd.service – Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/slurmd.service.d
└─override.conf
Active: active (running) since Fri 2023-02-10 14:14:21 CET; 20min ago
Main PID: 2250012 (slurmd)
Tasks: 5
Memory: 10.9M
CPU: 105ms
CGroup: /system.slice/slurmd.service
├─2250012 /usr/local/sbin/slurmd -D -s -f /var/spool/slurm/d/conf-cache/slurm.conf -vvvvvv
└─2250748 /usr/local/sbin/slurmstepd
“`
Also is spawned the expected job in `slurmstepd.scope`:
“`
$ systemctl status slurmstepd.scope
● slurmstepd.scope
Loaded: loaded (/run/systemd/transient/slurmstepd.scope; transient)
Transient: yes
Active: active (abandoned) since Fri 2023-02-10 14:14:21 CET; 22min ago
Tasks: 5
Memory: 1.4M
CPU: 28ms
CGroup: /system.slice/slurmstepd.scope
├─job_1278
│ └─step_extern
│ ├─slurm
│ │ └─2250609 slurmstepd: [1278.extern]
│ └─user
│ └─task_special
│ └─2250619 sleep 100000000
└─system
└─2250024 /usr/local/sbin/slurmstepd infinity
feb 10 14:14:21 dgx001.hpc systemd[1]: Started slurmstepd.scope.
“`
The slurm.conf file works without problems with others machines and is also tested.
Follow the service slurmd output:
“`
$ journalctl -u slurmd
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug: Waiting for job 1278’s prolog to complete
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug: Finished wait for job 1278’s prolog to complete
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: _rpc_launch_tasks: call to _forkexec_slurmstepd
feb
10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: slurmstepd rank
0 (dgx001), parent rank -1 (NONE), children 0, depth 0, max_depth 0
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: PLUGIN IDX
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: debug3: MPI CONF SEND
feb 10 14:14:57 dgx001.hpc slurmd[2250012]: slurmd: error: _send_slurmstepd_init failed
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: in the service_connection
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug2: Start processing RPC: REQUEST_TERMINATE_JOB
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: _rpc_terminate_job: uid = 3000 JobId=1278
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 998: ctime:1675770987 revoked:0 expires:2147483647
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 1184: ctime:1675953198 revoked:0 expires:2147483647
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 1217: ctime:1675967394 revoked:0 expires:2147483647
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug3: state for jobid 1278: ctime:1676034890 revoked:0 expires:2147483647
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: credential for job 1278 revoked
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: sent SUCCESS, waiting for step to start
feb 10 14:16:31 dgx001.hpc slurmd[2250012]: slurmd: debug: Blocked waiting for JobId=1278, all steps
“`
The function that fail is `_send_slurmstepd_init` at ‘req.c:634’
“`
if (mpi_conf_send_stepd(fd, job->mpi_plugin_id) !=
SLURM_SUCCESS){
debug3(“MPI CONF SEND”);
goto rwfail;
}
“`
`mpi_conf_send_stepd` fail at `slurm_mpi.c:635`:
“`
if ((index = _plugin_idx(plugin_id)) < 0) {
debug3(“PLUGIN IDX”);
goto rwfail;
}
“`
Configure settings:
“`
./configure
–prefix=/usr/local –libdir=/usr/lib64 –enable-pam
–enable-really-no-cray –enable-shared –enable-x11 –disable-static
–disable-salloc-background –disable-partial_attach
–with-oneapi=no –with-shared-libslurm –without-rpath –with-munge
–enable-developer
“`
I’m
sorry for the hyper-detailed mail, but I’ve no idea how to cope with
the issue, thus I hope that all details will be usefull to solve it.
Thanks in advace,
Niccolo
Read more here: Source link