Calling MPI subprocess within python script run from SLURM job

I am having trouble launching a SLURM job calling a mpirun subprocess from a python script. Inside the python script (let’s call it script.py) I have this subprocess.run:

 import subprocess
 def run_mpi(config_name, np, working_dir):

    data_path = working_dir + "/" + config_name 
    subprocess.run(
       [
          "mpirun -np "
          + np
          + " "
          + working_dir
          + "/spk_mpi -echo log < "
          + data_path
          + "/in.potts"
        ],
     # mpirun -np 32 spk_mpi -echo log < /$PATH/in.potts
     check=True,
     stderr=subprocess.PIPE,
     universal_newlines=True,
     stdout=subprocess.PIPE,
     shell=True,
    )

I then execute the script by submitting a SLURM job to a cluster node by something like:

#!/bin/bash
#SBATCH --job-name=myjob            
#SBATCH --nodes=1
#SBATCH --ntasks=32
#SBATCH --time=2-00:00:00               # Time limit hrs:min:sec
#SBATCH --partition=thin

python script.py --working_dir=$PATH --np=$SLURM_NTASKS

but somehow the subprocess is never executed. I also tried with changing the format of the subprocess to shell=False but get returned non-zero exit status 1 (i might do something wrong while parsing the input).

Note that if i don’t submit the script as a job i am able to execute the subprocess run; this is only happening with the batch job – if I first allocate resources with salloc and then run an interactively job I don’t run into this issue as well.

I’m not 100% sure, but it might be that when spawning a subprocess, that process doesn’t have the SLURM configuration variables passed properly, so it doesn’t know over which nodes to parallelize.

Any hint how to fix that?

Read more here: Source link