December 2024 Maintenance Details

Overview

This document provides a detailed overview of the updates and improvements made during the recent maintenance period. These changes are aimed at enhancing the system's reliability, performance, and user experience. Below is a breakdown of each major improvement.

Notable Changes

Phoenix System Software and Security Upgrades

 

The cluster's operating system was upgraded from Rocky Linux 8.9 to 8.10. This upgrade includes improved security features, performance optimizations, and support for newer libraries and tools. These updates enhance system stability and compatibility with modern software requirements.


Phoenix Slurm Scheduler Upgrade


The Slurm Scheduler was upgraded from version 23.11.5 to 24.5.0. This update brings better job scheduling algorithms, improved resource management, and compatibility with newer Slurm features. The new version also resolves several bugs, enhancing the overall user experience.


Phoenix Mamba and Jupyter Environment Updates


The Mamba package manager was updated from version 1.5.1 to 1.5.9, alongside updates to the Jupyter environments. These updates improve compatibility with newer Python libraries and address performance and stability issues.

  • If you need to use the older Mamba environment, you can load it with the command:

    • module load mamba/.1.5.1 instead of module load mamba/latest.


High-Availability Networking Repairs


Critical repairs were completed on the high-availability networking infrastructure to address reliability issues. These changes ensure a more robust and fault-tolerant network, reducing the risk of disruptions and improving overall connectivity for compute nodes and services.


Improved Zsh Compatibility


Updates were made to improve the compatibility of the Zsh shell:

  • Bash functions were migrated to standalone bash scripts, ensuring they work as expected regardless of the shell being used.


Phoenix Rebuild of OpenMPI for Broader Application Support


OpenMPI was rebuilt to expand compatibility and resolve prior issues:

  • Previously, OpenMPI was linked against compilers optimized for AVX512 instructions, causing silent failures on nodes lacking AVX512 support.

  • The new version (4.1.7) is available via module load openmpi/4.1.7.

  • The older version remains accessible via module load openmpi/4.1.5.

  • Users are encouraged to try the new module, as it will become the default in the future. However, the older module will remain available for now.


Other Notable Changes

 

  • The thisjob script has been enhanced to automatically check $SLURM_JOB_ID if no job ID is provided.

  • Added bash-completion support for interactive and other slurm commands

  • Automated node health checks have been revised.

 

If you would like any additional information about these changes, or find these changes are negatively impacting your work, please feel free to reach out

We also offer a series of Educational Opportunities and Workshops.