[slurm-users] Salloc expand feature

Hi —,

I have a question about a silent feature removal. It is about the –dependency:expand feature, that was present in Slurm for 10 years until its removal in version 21.08.03.

Until
Slurm 21.08.02, the expand option had an extensive documentation with
the dynamic job elasticity features, with resource shrinkage and
expansion options. You can see this in the archived FAQ: slurm.schedmd.com/archive/slurm-21.08.2/faq.html#job_size

Just
a note that these features were added in Slurm 2.3 back in 2011, so it
was supported for nearly 10 years. For instance, see Slide 7 in slurm.schedmd.com/slurm_ug_2011/SLURM.v23.status.pdf

However,
the expand option was silently removed from the Slurm documentation in
October 22nd 2021, few weeks before the release of Slurm 21.08.03: github.com/SchedMD/slurm/commit/11ce912f31519799494fde3140f530cfc8cfff6a

There
was no announcement as to why the feature was removed. As one can see
in the release notes for Slurm version 21.08.03 that happened in
November, 2021, nothing is really mentioned: lists.schedmd.com/pipermail/slurm-announce/2021/000066.html

Today, one can still dynamically “shrink” a job though: slurm.schedmd.com/faq.html#job_size

My
question is: why was the feature removed? What were the conceptual and
technical issues that made not supporting this feature an option?

I
can understand why properly expanding a job may be tricky, and why
shrinking it is not. Specially with queued jobs that may be waiting.
However, having jobs to wait more, or less, is a well known expectation
in HPC cluster. I thought a clearer reasoning as of why the feature was
removed would be worth learning about.

Thank you,

/Abel

Read more here: Source link