Fall Maintenance 2024 Information
General Updates
Rocky OS Security updates
Slurm version update
24.05.1 to 24.05.3 for security updates
Turned on energy use per-node(energy tracking)
Additional 16 GPU MIG instances are available
Web portal 3.0.3 upgrade to 3.1.7
Globus functionality in portal allows easy access to globus.org file manager.
Update to Jupyter Lab
Update to mamba python environment manager
Modernized job_submit plugin
interactive
/salloc
submitted jobs will now announce default values applicable to your job when they are omitted.
$ salloc -t 240
salloc: QOS not specified; assigning "public" qos
salloc: cpus-per-task not specified; assigning 1 core
salloc: time_limit <= 240 and Partition not specified; assigning "htc" partition
salloc: Pending job allocation 19824107
salloc: job 19824107 queued and waiting for resources
$ salloc -p general -t 240 -q public -c 1
salloc: Pending job allocation 19824117
salloc: job 19824117 queued and waiting for resources
Technical Updates
Warewulf upgraded to 4.5.8-1
Grace Hopper Firmware upgraded 3.17.0
Dell Powerstore storage devices firmware upgraded 2.1.1.1 to 3.6.1.3
Firewall firmware updated
Added arbiter to soldtn node
Mamba updated to 1.5.10
Jupyter Notebook updated to the newest version at the moment
MPI Performance Metric
We used the OSU Micro-Benchmarks (OMB) v7.4 from Ohio State University to check the status of nodes before and after maintenance. These tests measure bandwidth and latency on randomly selected, unique node pairs across all nodes, using all eight MPI modules on Sol. The goal is to ensure the system functions properly before the cluster is released at the end of maintenance. This workflow runs a large number of test jobs to verify the health of individual nodes, MPI modules, the Mamba module, and Slurm.
To view the resulting plots from before and after this maintenance: https://drive.google.com/drive/folders/1sifd2htIBrTLhjsLWXD2UDDkYxGfzZKo?usp=sharing