Overview
This document provides a detailed overview of the updates and improvements made during the recent maintenance period. These changes are aimed at enhancing the system's reliability, performance, and user experience. Below is a breakdown of each major improvement.
Notable Changes
System Software and Security Upgrades
The cluster's operating system was upgraded from Rocky Linux 8.9 to 8.10. This upgrade includes improved security features, performance optimizations, and support for newer libraries and tools. These updates enhance system stability and compatibility with modern software requirements.
Slurm Scheduler Upgrade
The Slurm Scheduler was upgraded from version 23.11.5 to 24.5.0. This update brings better job scheduling algorithms, improved resource management, and compatibility with newer Slurm features. The new version also resolves several bugs, enhancing the overall user experience.
Mamba and Jupyter Environment Updates
The Mamba package manager was updated from version 1.5.1 to 1.5.9, alongside updates to the Jupyter environments. These updates improve compatibility with newer Python libraries and address performance and stability issues.
If you need to use the older Mamba environment, you can load it with the command:
module load mamba/.1.5.1
instead ofmodule load mamba/latest
.
High-Availability Networking Repairs
Critical repairs were completed on the high-availability networking infrastructure to address reliability issues. These changes ensure a more robust and fault-tolerant network, reducing the risk of disruptions and improving overall connectivity for compute nodes and services.
Improved Zsh Compatibility
Updates were made to improve the compatibility of the Zsh shell:
Bash functions were migrated to standalone bash scripts, ensuring they work as expected regardless of the shell being used.
Rebuilt OpenMPI for Broader Application Support
OpenMPI was rebuilt to expand compatibility and resolve prior issues:
Previously, OpenMPI was linked against compilers optimized for AVX512 instructions, causing silent failures on nodes lacking AVX512 support.
The new version (4.1.7) is available via
module load openmpi/4.1.7
.The older version remains accessible via
module load openmpi/4.1.5
.Users are encouraged to try the new module, as it will become the default in the future. However, the older module will remain available for now.
Other Notable Changes
The
thisjob
script has been enhanced to automatically check$SLURM_JOB_ID
if no job ID is provided.Automated node health checks have been revised to include additonal checks
If you would like any additional information about these changes, or find these changes are negatively impacting your work, please feel free to reach out to us.
We also offer a series of Educational Opportunities and Workshops.