Using Graphics Processing Units (GPUs)
As of April 2021, Agave has over 115 GPU nodes with 340+ GPUs available for accelerating research computation. These GPU nodes are grouped into two main types: public and private.
Public GPUs are available to all researchers on Agave just like the main CPU partitions serial
and parallel
: jobs will run until they fail, reach their time-limit, or complete. These GPUs are generally accessible via the publicgpu
partition.
Private nodes are purchased by researchers who have prioritized access to their hardware. Private nodes are available to the public in two modes: high-throughput computing (HTC) or fully opportunistic. In the HTC model, GPUs may be requested for up to 4 hours via the htcgpu
partition. In the fully opportunistic model, GPUs may be requested for up to 1 week at the risk of preemption, i.e. the opportunistic job may cancelled before completing as to allow the hardware owner’s job to begin.
All GPUs may be requested via the gpu
partition.
For public users: The gpu
partition is a superset of all public and private partitions, and includes the non-opportunistic publicgpu
and htcgpu
supersets. Jobs submitted to gpu
may be opportunistically utilizing private (researcher purchased) hardware and may succumb to preemption (that is, the job may be cancelled to allow the owner’s job to start).
As gpu
is the largest partition, jobs are likely to start soonest when submitted there. However it is recommended to ensure that production-level jobs submitted to gpu
have checkpointing on the order of 4 hours of wall time to enable manual resume of compute in the case of preemption, which also allows for use of the htcgpu
partition.
Jobs submitted to publicgpu
or htcgpu
cannot be preempted! publicgpu
jobs have a max wall time of 1 week while htcgpu
jobs have a max wall time of 4 hours.
Important: Be sure you understand how to use sbatch scripts
Please make sure that you understand how to use sbatch scripts on Agave before trying to use GPU resources on Agave.
The information below assumes that you have read and understood the information linked to above. This documentation page goes into more detail on sbatch scripts for GPUs below the Resource table.
Description of Available Resources
There are currently (April 2021) three main partitions on Agave that contain GPU nodes. You can allow the system to choose from multiple partitions by putting the partition names separated by commas in quotes.
Below is a list of all the GPU partitions and their specs:
GPU Partitions – (All are accessible with partition superset | ||||||
---|---|---|---|---|---|---|
Active Private GPU Partitions – Fully Opportunistic – | ||||||
Partition Name | Node Groups | Cores per Node | Node Mem. (GiB) | -C Constraint Flag | GPUs per Node | GPU Memory (MiB) |
physicsgpu1 | cg19-[1-11] | 24 | 62 | GTX1080 | 4 | 11,178 |
physicsgpu2 | gpu2-[1-6] | 24 | 92 | RTX2080 | 4 | 11,019 |
cidsegpu1 | cg20-[1-9] | 20 | 92 | V100_16, V100 | 4 | 16,160 |
cidsegpu2 | cg20-10 | 24 | 187 | V100_16, V100 | 8 | 16,160 |
cidsegpu3 | s72-[1-3] | 24 | 251 | GTX1080 | 4 | 8,120 |
cmuhichgpu1 | gpu6-[1-3] | 24 | 125 | RTX2080 | 4 | 11,019 |
gdcsgpu1 | cg28-1 | 24 | 187 | V100_32, V100 | 2 | 32,510 |
gdcsgpu1 | cg28-2 | 24 | 187 | RTX2080 | 4 | 11,019 |
gdcsgpu1 | cg28-3 | 24 | 376 | RTX2080 | 2 | 7,981 |
jlianggpu1 | dgx1-1 | 40 | 503 | V100_32, V100 | 8 | 32,480 |
lsankargpu1 | gpu7-1 | 32 | 187 | RTX2080 | 4 | 11,019 |
lsankargpu2 | gpu7-2 | 52 | 187 | A100 | 1 | 40,536 |
rgerkingpu1 | gpu4-1 | 24 | 187 | V100_16, V100 | 1 | 16,160 |
sjayasurgpu1 | gpu5-1 | 24 | 187 | RTX2080 | 4 | 11,019 |
sjayasurgpu1 | gpu5-2 | 52 | 187 | RTX6000 | 3 | 22,698 |
yyanggpu1 | cg20-11 | 36 | 92 | V100_16, V100 | 4 | 16,160 |
sulcgpu1 | cg21-1 | 20 | 62 | GTX1080 | 8 | 11,178 |
sulcgpu2 | cg21-2 | 28 | 92 | GTX1080 | 4 | 11,178 |
mrlinegpu1 | cg24-1 | 48 | 376 | V100_32, V100 | 4 | 32,480 |
asinghargpu1 | cg23-[1-19] | 8 | 92 | GTX1080 | 1 | 11,178 |
asinghargpu2 | gpu8-1 | 52 | 187 | V100_16, V100 | 2 | 16,160 |
wzhengpu1 | cg25-[1-5] | 24 | 92 | RTX2080 | 4 | 7,952 |
sulcgpu1 | gpu14-1 | 32 | 192 | A6000 | 8 | 49,140 |
mrlinegpu2 | gpu15-1 | 64 | 256 | A100 | 2 | 40,960 |
Active HTC Private GPU Partitions – | ||||||
Partition Name | Node Groups | Cores per Node | Node Mem. (GiB) | -C Constraint Flag | GPUs per Node | GPU Memory (MiB) |
htcgpu1 | cg32-[1-4] | 40 | 92 | V100_16, V100 | 4 | 16,160 |
htcgpu2 | cg32-[5-6] | 40 | 187 | V100_32, V100 | 4 | 32,320 |
htcgpu3 | gpu9-1 | 52 | 251 | V100_16, V100 | 4 | 16,160 |
htcgpu4 | gpu10-1 | 64 | 256 | A100 | 6 | 40,536 |
htcgpu5 | gpu11-1 | 56 | 192 | A100 | 2 | 40,536 |
htcgpu6 | gpu12-1 | 128 | 512 | A100 | 4 | 81,251 |
htcgpu7 | gpu13-1 | 128 | 512 | A100 | 4 | 81,251 |
htcgpu8 | gpu13-1 | 128 | 512 | A100 | 4 | 81,251 |
Active RC GPU Partitions (All are exclusively accessible via the superset partition | ||||||
Partition Name | Node Groups | Cores per Node | Node Mem. (GiB) | -C Constraint Flag | GPUs per Node | GPU Memory (MiB) |
rcgpu1 | cg26-[1-2] | 40 | 376 | V100_32, V100 | 4 | 32,480 |
rcgpu3 | s65-[1-3] | 16 | 251 | K40 | 2 | 11,441 |
rcgpu4 | s71-1 | 16 | 62 | GTX1080; K20 | 1 | 4,744 |
rcgpu6 | gpu1-[1-2] | 40 | 187 | V100_16, V100 | 4 | 16,160 |
rcgpu7 | gpu3-[2-30] | 24 | 125 | K80 | 2 | 11,441 |
Sending Jobs to a GPU Partition
To send jobs to one of these partitions, there are a few special parameters that need to be passed to slurm.
For example, if someone wanted to use two of the GPUs and one of the CPUs on a node in the gpu
superset partition, then the header for his or her sbatch script would look something like this:
#!/bin/bash
#SBATCH -N 1 # number of compute nodes
#SBATCH -n 1 # number of tasks your job will spawn
#SBATCH --mem=4G # amount of RAM requested in GiB (2^40)
#SBATCH -p publicgpu # Use gpu partition
#SBATCH -q wildfire # Run job under wildfire QOS queue
#SBATCH --gres=gpu:2 # Request two GPUs
#SBATCH -t 0-12:00 # wall time (D-HH:MM)
#SBATCH -o slurm.%j.out # STDOUT (%j = JobId)
#SBATCH -e slurm.%j.err # STDERR (%j = JobId)
#SBATCH --mail-type=ALL # Send a notification when a job starts, stops, or fails
#SBATCH --mail-user=myemail@asu.edu # send-to address
...
nvidia-smi # Useful for seeing GPU status and activity
./myresearchjob
...
Most of the options used in the script above are explained in our sbatch documentation.
Here is an explanation of the new options needed for GPU jobs:
#SBATCH -p gpu
This option defines the partition the job should be run on.
This parameter is required. If you are affiliated with a group that owns the GPU partition, you should replace this with the partition name, e.g.,
cidsegpu1
,physicsgpu1
, etc.
#SBATCH -q wildfire
This option defines the QOS queue that the job will be run under.
With the exception of the researchers who purchased these nodes, all users are required to use the wildfire QOS queue. On private, fully-opportunistic partitions, only the owners may run non-preemptable jobs on them.
What this means is that your jobs (submitted to private, fully-opportunistic hardware) may be preempted (cancelled) to allow jobs from hardware owners to run.
Jobs submitted to
publicgpu
are non-preemptable.For jobs that are 4 hours or less, the
htcgpu
partition may be accessed with thenormal
QOS. These jobs are non-preemptable.
#SBATCH --gres=gpu:2
This tells the system that you want to request two GPUs
This option is required.
You can request a single GPU by putting a 1 at the end of this line instead of 2.
The number of GPUs you can request depends on the partition used e.g.:
For the physicsgpu1 partition, the limit is 4
For the cidsegpu1 partition, the limit is 4
For the sulcgpu1 partition, the limit is 8
et cetera
The GPU type may be specified with the additional notation, for instance:
--gres=gpu:GTX1080:1
requests a GTX1080. This may also be done with the constraint flag, detailed below.
Advanced options:
#SBATCH -C V100
This is an optional component that helps specify constraints on your submitted job. Such constraints might be ensuring you are on a GPU with 32GB of memory, instead of 16GB which is more common.
Other such available constraints are:
K20
,K40
,K80
,GTX1080
,RTX2080
,V100
,V100_16
,V100_32
, orA100
.It is always recommended for users of the
gpu
partition to specify their preferred type of GPU via this -C flag rather than using the corresponding partition.
Cuda Versions Available on Agave
Cuda is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).
To check for Cuda versions on the supercomputer use the following command
module avail cuda