Hi all,
We have a cluster of 4 nodes and submit jobs using slurm. Occasionally, two jobs that require >50% of scratch space get sent to the same node, and the second job has to wait for hours until the scratch is available. My two workarounds are to keep resubmitting the waiting job until it gets sent to a different node, or to set up additional lanes corresponding to each node “for emergency use only”, but there must be a better way. Is it possible to make the -nodelist option available to be sent to slurm in these cases, or is there some other solution?
Thanks for advice
Kevin
Read more here: Source link