This includes several significant fixes to the upgrade process,
including remote licenses allowed percentages being reset to 0 during
the upgrade and a few issues during rolling upgrades.
Slurm can be downloaded from www.schedmd.com/downloads.php.
– Marshall
—
Marshall Garey
Release Management, Support, and Development
SchedMD LLC – Commercial Slurm Development and Support
> * Changes in Slurm 23.02.1
> ==========================
> — job_container/tmpfs – cleanup job container even if namespace mount is
> already unmounted.
> — When cluster specific tables are be removed also remove the job_env_table
> and job_script_table.
> — Fix the way bf_max_job_test is applied to job arrays in backfill.
> — data_parser/v0.0.39 – Avoid dumping -1 value or NULL when step’s
> consumed_energy is unset.
> — scontrol – Fix showing Array Job Steps.
> — scontrol – Fix showing Job HetStep.
> — openapi/dbv0.0.38 – Fix not displaying an error when updating QOS or
> associations fails.
> — data_parser/v0.0.39 – Avoid crash while parsing composite structures.
> — sched/backfill – fix deleted planned node staying in planned node bitmap.
> — Fix nodes remaining as PLANNED after slurmctld save state recovery.
> — Fix parsing of cgroup.controllers file with a blank line at the end.
> — Add cgroup.conf EnableControllers option for cgroup/v2.
> — Get correct cgroup root to allow slurmd to run in containers like Docker.
> — Fix “(null)” cluster name in SLURM_WORKING_CLUSTER env.
> — slurmctld – add missing PrivateData=jobs check to step ContainerID lookup
> requests originated from ‘scontrol show step container-id=<id>’ or certain
> scrun operations when container state can’t be directly queried.
> — Automatically sort the TaskPlugin list reverse-alphabetically. This
> addresses an issue where cpu masks were reset if task/affinity was listed
> before task/cgroup on cgroup/v2 systems with Linux kernel < 6.2.
> — Fix some failed terminate job requests from a 23.02 slurmctld to a 22.05 or
> 21.08 slurmd.
> — Fix compile issues on 32-bit systems.
> — Fix nodes un-draining after being drained due to unkillable step.
> — Fix remote licenses allowed percentages reset to 0 during upgrade.
> — sacct – Avoid truncating time strings when using SLURM_TIME_FORMAT with
> the –parsable option.
> — data_parser/v0.0.39 – fix segfault when default qos is not set.
> — Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
> — openapi/v0.0.39 – fix jobs submitted via slurmrestd being allocated fewer
> CPUs than tasks when requesting multiple tasks.
> — Fix job not being scheduled on valid nodes and potentially being rejected
> when using parentheses at the beginning of square brackets in a feature
> request, for example: “feat1&[(feat2|feat3)]”.
> — Fix a job being scheduled on nodes that do not match a feature request that
> uses parentheses inside of brackets and requests additional features outside
> of brackets, for example: “feat1&[feat2|(feat3|feat4)]”.
> — Fix regression in 23.02.0rc1 which made –gres-flags=enforce-binding no
> longer enforce optimal core-gpu job placement.
> — switch/hpe_slingshot – add option to disable VNI allocation per-job.
> — switch/hpe_slingshot – restrict CXI services to the requesting user.
> — switch/hpe_slingshot – Only output tcs once in SLINGSHOT_TCS env.
> — switch/hpe_slingshot – Fix updating LEs and ACs limits.
> — switch/hpe_slingshot – Use correct Max for EQs and CTs.
> — switch/hpe_slingshot – support configuring network options per-job.
> — switch/hpe_slingshot – retry destroying CXI service if necessary.
> — Fix memory leak caused by job preemption when licenses are configured.
> — mpi/pmix – Fix v5 to load correctly when libpmix.so isn’t in the normal
> lib path.
> — data_parser/v0.0.39 – fix regression where “memory_per_node” would be
> rejected for job submission.
> — data_parser/v0.0.39 – fix regression where “memory_per_cpu” would be
> rejected for job submission.
> — slurmctld – add an assert to check for magic number presence before deleting
> a partition record and clear the magic afterwards to better diagnose
> potential memory problems.
> — Clean up OCI containers task directories correctly.
> — slurm.spec – add “–with jwt” option.
> — scrun – Run under existing job when SLURM_JOB_ID is present.
> — Prevent a slurmstepd crash when the I/O subsystem has hung.
> — common/conmgr – fix memory leak of complete connection list.
> — data_parser/v0.0.39 – fix memory leak when parsing every field in a struct.
> — job_container/tmpfs – avoid printing extraneous error messages when running
> a spank plugin that implements slurm_spank_job_prolog() or
> slurm_spank_job_epilog().
> — Fix srun < 23.02 always getting an “exact” core allocation.
> — Prevent scontrol < 23.02 from setting MaxCPUsPerSocket to 0.
> — Add ScronParameters=explicit_scancel and corresponding scancel –cron
> option.
Read more here: Source link