Simplify your life with SLURM and sync

For my first blog post of the year, we’re talking about SLURM, everyone’s favorite job manager. If like me, you have the joy of running a literal boat-load of jobs with all kinds of parameters and command-line arguments you’ll know there are a few tips and tricks that make the process of managing these tasks and results as painless as possible. Now, I do expect most people reading this will already be aware of these tricks but for those who don’t, I hope this is helpful. After all, it’s impossible to know what you don’t know you need to know, you know? Any alternatives, improvements, or suggestions are welcome!

Array Jobs

Job arrays are perfect for the times you want to run the same job several times with slight differences each time. Imagine you need to repeat a job 10 times with slightly different arguments with each run. Rather than submit 10 (slightly different) batch scripts you can submit 1 script with all the information needed to complete all 10 jobs.

To do this, include the array command either when submitting the job:

sbatch --array=0-9%2 example_slurm_script.sh

or in the slurm script itself:

#SBATCH --array=0-9%2

In the examples above the slurm script will run 10 jobs (0 to 9) but will only ever run 2 at a time (denoted by the %2) so you don’t hog all the resources. Besides a range of jobs to run, you can also provide a list of indexes. This is useful if you only need to run a selection of jobs.

sbatch --array=0,2,4,6%2 example_slurm_script.sh

Each time you run submit an array, each job will have a different task ID (SLURM_ARRAY_TASK_ID). This value is unique to each job, unlike the job ID (SLURM_ARRAY_JOB_ID) which is the same for all. Therefore, since each job has a unique SLURM_ARRAY_TASK_ID, you can use it in your slurm script to reference variables you wish to change each time.

# define a dict of parameters (you could also use a list instead).
declare -A dict_1=([0]=10 [1]=20 [2]=30 [3]=40)
declare -A dict_2=([0]=2 [1]=4 [2]=6 [3]=8)

num_1=${dict_1[${SLURM_ARRAY_TASK_ID}]} #get the number with key equal to task ID.
num_2=${dict_2[${SLURM_ARRAY_TASK_ID}]} #get the number with key equal to task ID.

python add_numbers.py $num_1 $num_2 #script adds the two numbers together and prints the results

In this example, the num_1 and num_2 arguments are referenced from the initial dictionaries and change with the task ID. This is a very simple example but hopefully, you get the idea and this will save you some time moving forward.

RSYNC

While technically not SLURM related, one tool I discovered recently that has made my life a little easier is rsync. Rsync (remote sync) is is a remote/local file synchronization tool and allows you to keep a remote and local dir up to date and only send/receive files that have changed. In almost all cases some kind of version control is the best way to keep your files in check, however, rsync has come in clutch.

A really simple example use-case below could be used after a job has you have updated a remote copy of a file but would like to sync that with pegasus.

rysnc -a /remove/file/path.py @:/local/file/path.py

Rsync has a lot more to offer, I found the linked article very helpful when getting started: [https://www.digitalocean.com/community/tutorials/how-to-use-rsync-to-sync-local-and-remote-directories]

Read more here: Source link