[slurm-users] Slurm version 23.02.1 is now available

We are pleased to announce the availability of Slurm version 23.02.1.

This includes several significant fixes to the upgrade process,
including remote licenses allowed percentages being reset to 0 during
the upgrade and a few issues during rolling upgrades.

Slurm can be downloaded from www.schedmd.com/downloads.php.

– Marshall


Marshall Garey
Release Management, Support, and Development
SchedMD LLC – Commercial Slurm Development and Support

> * Changes in Slurm 23.02.1
> ==========================
> — job_container/tmpfs – cleanup job container even if namespace mount is
> already unmounted.
> — When cluster specific tables are be removed also remove the job_env_table
> and job_script_table.
> — Fix the way bf_max_job_test is applied to job arrays in backfill.
> — data_parser/v0.0.39 – Avoid dumping -1 value or NULL when step’s
> consumed_energy is unset.
> — scontrol – Fix showing Array Job Steps.
> — scontrol – Fix showing Job HetStep.
> — openapi/dbv0.0.38 – Fix not displaying an error when updating QOS or
> associations fails.
> — data_parser/v0.0.39 – Avoid crash while parsing composite structures.
> — sched/backfill – fix deleted planned node staying in planned node bitmap.
> — Fix nodes remaining as PLANNED after slurmctld save state recovery.
> — Fix parsing of cgroup.controllers file with a blank line at the end.
> — Add cgroup.conf EnableControllers option for cgroup/v2.
> — Get correct cgroup root to allow slurmd to run in containers like Docker.
> — Fix “(null)” cluster name in SLURM_WORKING_CLUSTER env.
> — slurmctld – add missing PrivateData=jobs check to step ContainerID lookup
> requests originated from ‘scontrol show step container-id=<id>’ or certain
> scrun operations when container state can’t be directly queried.
> — Automatically sort the TaskPlugin list reverse-alphabetically. This
> addresses an issue where cpu masks were reset if task/affinity was listed
> before task/cgroup on cgroup/v2 systems with Linux kernel < 6.2.
> — Fix some failed terminate job requests from a 23.02 slurmctld to a 22.05 or
> 21.08 slurmd.
> — Fix compile issues on 32-bit systems.
> — Fix nodes un-draining after being drained due to unkillable step.
> — Fix remote licenses allowed percentages reset to 0 during upgrade.
> — sacct – Avoid truncating time strings when using SLURM_TIME_FORMAT with
> the –parsable option.
> — data_parser/v0.0.39 – fix segfault when default qos is not set.
> — Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
> — openapi/v0.0.39 – fix jobs submitted via slurmrestd being allocated fewer
> CPUs than tasks when requesting multiple tasks.
> — Fix job not being scheduled on valid nodes and potentially being rejected
> when using parentheses at the beginning of square brackets in a feature
> request, for example: “feat1&[(feat2|feat3)]”.
> — Fix a job being scheduled on nodes that do not match a feature request that
> uses parentheses inside of brackets and requests additional features outside
> of brackets, for example: “feat1&[feat2|(feat3|feat4)]”.
> — Fix regression in 23.02.0rc1 which made –gres-flags=enforce-binding no
> longer enforce optimal core-gpu job placement.
> — switch/hpe_slingshot – add option to disable VNI allocation per-job.
> — switch/hpe_slingshot – restrict CXI services to the requesting user.
> — switch/hpe_slingshot – Only output tcs once in SLINGSHOT_TCS env.
> — switch/hpe_slingshot – Fix updating LEs and ACs limits.
> — switch/hpe_slingshot – Use correct Max for EQs and CTs.
> — switch/hpe_slingshot – support configuring network options per-job.
> — switch/hpe_slingshot – retry destroying CXI service if necessary.
> — Fix memory leak caused by job preemption when licenses are configured.
> — mpi/pmix – Fix v5 to load correctly when libpmix.so isn’t in the normal
> lib path.
> — data_parser/v0.0.39 – fix regression where “memory_per_node” would be
> rejected for job submission.
> — data_parser/v0.0.39 – fix regression where “memory_per_cpu” would be
> rejected for job submission.
> — slurmctld – add an assert to check for magic number presence before deleting
> a partition record and clear the magic afterwards to better diagnose
> potential memory problems.
> — Clean up OCI containers task directories correctly.
> — slurm.spec – add “–with jwt” option.
> — scrun – Run under existing job when SLURM_JOB_ID is present.
> — Prevent a slurmstepd crash when the I/O subsystem has hung.
> — common/conmgr – fix memory leak of complete connection list.
> — data_parser/v0.0.39 – fix memory leak when parsing every field in a struct.
> — job_container/tmpfs – avoid printing extraneous error messages when running
> a spank plugin that implements slurm_spank_job_prolog() or
> slurm_spank_job_epilog().
> — Fix srun < 23.02 always getting an “exact” core allocation.
> — Prevent scontrol < 23.02 from setting MaxCPUsPerSocket to 0.
> — Add ScronParameters=explicit_scancel and corresponding scancel –cron
> option.

Read more here: Source link