Using Containers on the Grace Hopper GH100

The Grace Hopper node (sgh100) on the Sol supercomputer is a unique piece of equipment featuring an ARM-based processor (rather than x86_64) and a GPU. This combination lends to new capabilities, such as leveraging the high memory from the node and the modern h100GPU.

ARM processors do not run x86_64 software, so a special set of software tools are made available to leverage this hardware. You will need to compile for aarch64 for software to properly run on this node.

Alternatives to compiling ARM-based software include using Apptainer Containers compiled with ARM. On Sol, we have the following containers known to be working with this node:

[software@sgh001:~]$ ls -1 /packages/aarch64/simg/
autodock_2020.06.sif*
chroma_2021.04.sif*
gromacs_2023.2.sif*
julia_v2.4.1.sif*
lammps_patch_15Jun2023.sif*
nvhpc_24.5-devel-cuda_multi-ubuntu22.04.sif*
pytorch_24.05-py3.sif*
quantum_espresso_qe-7.1.sif*
relion_3.1.3.sif*
tensorflow_24.05-tf2-py3-igpu.sif*

Requesting Grace Hopper from the Job Scheduler

Using the following commands, you can request an allocation on this node and run these containers:

$ salloc -p highmem -Lgracehopper

OR, to also get the H100 GPU:

$ salloc -p highmem -Lgracehopper -G 1

Running a container with Grace Hopper

$ apptainer run pytorch_24.05-py3.sif         # when CPU only, OR

$ apptainer run --nv pytorch_24.05-py3.sif    # when GPU is requested


=============
== PyTorch ==
=============

NVIDIA Release 24.05 (build 91431256)
PyTorch Version 2.4.0a0+07cecf4
Apptainer> python
Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True