Fall Maintenance 2024 Information

Fall Maintenance 2024 Information

General Updates

  • Rocky OS Security updates

  • Slurm version update

    • 24.05.1 to 24.05.3 for security updates

    • Turned on energy use per-node(energy tracking)

  • Additional 16 GPU MIG instances are available

  • Web portal 3.0.3 upgrade to 3.1.7

    • Globus functionality in portal allows easy access to globus.org file manager.

    • Update to Jupyter Lab

  • Update to mamba python environment manager

  • Modernized job_submit plugin 

    • interactive / salloc submitted jobs will now announce default values applicable to your job when they are omitted.

$ salloc -t 240 salloc: QOS not specified; assigning "public" qos salloc: cpus-per-task not specified; assigning 1 core salloc: time_limit <= 240 and Partition not specified; assigning "htc" partition salloc: Pending job allocation 19824107 salloc: job 19824107 queued and waiting for resources $ salloc -p general -t 240 -q public -c 1 salloc: Pending job allocation 19824117 salloc: job 19824117 queued and waiting for resources

Technical Updates

  • Warewulf upgraded to 4.5.8-1

  • Grace Hopper Firmware upgraded 3.17.0

  • Dell Powerstore storage devices firmware upgraded 2.1.1.1 to 3.6.1.3

  • Firewall firmware updated

  • Added arbiter to soldtn node

  • Mamba updated to 1.5.10

  • Jupyter Notebook updated to the newest version at the moment

MPI Performance Metric

We used the OSU Micro-Benchmarks (OMB) v7.4 from Ohio State University to check the status of nodes before and after maintenance. These tests measure bandwidth and latency on randomly selected, unique node pairs across all nodes, using all eight MPI modules on Sol. The goal is to ensure the system functions properly before the cluster is released at the end of maintenance. This workflow runs a large number of test jobs to verify the health of individual nodes, MPI modules, the Mamba module, and Slurm.

To view the resulting plots from before and after this maintenance: https://drive.google.com/drive/folders/1sifd2htIBrTLhjsLWXD2UDDkYxGfzZKo?usp=sharing