I am having trouble launching a SLURM job calling a mpirun
subprocess from a python script. Inside the python script (let’s call it script.py
) I have this subprocess.run
:
import subprocess
def run_mpi(config_name, np, working_dir):
data_path = working_dir + "/" + config_name
subprocess.run(
[
"mpirun -np "
+ np
+ " "
+ working_dir
+ "/spk_mpi -echo log < "
+ data_path
+ "/in.potts"
],
# mpirun -np 32 spk_mpi -echo log < /$PATH/in.potts
check=True,
stderr=subprocess.PIPE,
universal_newlines=True,
stdout=subprocess.PIPE,
shell=True,
)
I then execute the script by submitting a SLURM job to a cluster node by something like:
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=2-00:00:00 # Time limit hrs:min:sec
#SBATCH --partition=thin
python script.py --working_dir=$PATH --np=$SLURM_NTASKS
but somehow the subprocess is never executed. I also tried with changing the format of the subprocess to shell=False
but get returned non-zero exit status 1
(i might do something wrong while parsing the input).
Note that if i don’t submit the script as a job i am able to execute the subprocess run; this is only happening with the batch job – if I first allocate resources with salloc
and then run an interactively job I don’t run into this issue as well.
I’m not 100% sure, but it might be that when spawning a subprocess, that process doesn’t have the SLURM configuration variables passed properly, so it doesn’t know over which nodes to parallelize.
Any hint how to fix that?
Read more here: Source link