Using Graphics Processing Units (GPUs)

As of April 2021, Agave has over 115 GPU nodes with 340+ GPUs available for accelerating research computation. These GPU nodes are grouped into two main types: public and private.

Public GPUs are available to all researchers on Agave just like the main CPU partitions serial and parallel: jobs will run until they fail, reach their time-limit, or complete. These GPUs are generally accessible via the publicgpu partition.

Private nodes are purchased by researchers who have prioritized access to their hardware. Private nodes are available to the public in two modes: high-throughput computing (HTC) or fully opportunistic. In the HTC model, GPUs may be requested for up to 4 hours via the htcgpu partition. In the fully opportunistic model, GPUs may be requested for up to 1 week at the risk of preemption, i.e. the opportunistic job may cancelled before completing as to allow the hardware owner’s job to begin.

All GPUs may be requested via the gpu partition.

For public users: The gpu partition is a superset of all public and private partitions, and includes the non-opportunistic publicgpu and htcgpu supersets. Jobs submitted to gpu may be opportunistically utilizing private (researcher purchased) hardware and may succumb to preemption (that is, the job may be cancelled to allow the owner’s job to start).

As gpu is the largest partition, jobs are likely to start soonest when submitted there. However it is recommended to ensure that production-level jobs submitted to gpu have checkpointing on the order of 4 hours of wall time to enable manual resume of compute in the case of preemption, which also allows for use of the htcgpu partition.

Jobs submitted to publicgpu or htcgpu cannot be preempted! publicgpu jobs have a max wall time of 1 week while htcgpu jobs have a max wall time of 4 hours.

Important: Be sure you understand how to use sbatch scripts

Please make sure that you understand how to use sbatch scripts on Agave before trying to use GPU resources on Agave.

The information below assumes that you have read and understood the information linked to above. This documentation page goes into more detail on sbatch scripts for GPUs below the Resource table.


Description of Available Resources

There are currently (April 2021) three main partitions on Agave that contain GPU nodes. You can allow the system to choose from multiple partitions by putting the partition names separated by commas in quotes.

Below is a list of all the GPU partitions and their specs:

GPU Partitions – (All are accessible with partition superset gpu with wildfire QOS)

GPU Partitions – (All are accessible with partition superset gpu with wildfire QOS)

Active Private GPU Partitions – Fully Opportunistic – wildfire QOS Max Wall Time: 1 week

Partition Name

Node Groups

Cores per Node

Node Mem. (GiB)

-C Constraint Flag

GPUs per Node

GPU Memory (MiB)

physicsgpu1

cg19-[1-11]

24

62

GTX1080

4

11,178

physicsgpu2

gpu2-[1-6]

24

92

RTX2080

4

11,019

cidsegpu1

cg20-[1-9]

20

92

V100_16, V100

4

16,160

cidsegpu2

cg20-10

24

187

V100_16, V100

8

16,160

cidsegpu3

s72-[1-3]

24

251

GTX1080

4

8,120

cmuhichgpu1

gpu6-[1-3]

24

125

RTX2080

4

11,019

gdcsgpu1

cg28-1

24

187

V100_32, V100

2

32,510

gdcsgpu1

cg28-2

24

187

RTX2080

4

11,019

gdcsgpu1

cg28-3

24

376

RTX2080

2

7,981

jlianggpu1

dgx1-1

40

503

V100_32, V100

8

32,480

lsankargpu1

gpu7-1

32

187

RTX2080

4

11,019

lsankargpu2

gpu7-2

52

187

A100

1

40,536

rgerkingpu1

gpu4-1

24

187

V100_16, V100

1

16,160

sjayasurgpu1

gpu5-1

24

187

RTX2080

4

11,019

sjayasurgpu1

gpu5-2

52

187

RTX6000

3

22,698

yyanggpu1

cg20-11

36

92

V100_16, V100

4

16,160

sulcgpu1

cg21-1

20

62

GTX1080

8

11,178

sulcgpu2

cg21-2

28

92

GTX1080

4

11,178

mrlinegpu1

cg24-1

48

376

V100_32, V100

4

32,480

asinghargpu1

cg23-[1-19]

8

92

GTX1080

1

11,178

asinghargpu2

gpu8-1

52

187

V100_16, V100

2

16,160

wzhengpu1

cg25-[1-5]

24

92

RTX2080

4

7,952

sulcgpu1

gpu14-1

32

192

A6000

8

49,140

mrlinegpu2

gpu15-1

64

256

A100

2

40,960

Active HTC Private GPU Partitions – htcgpu – No preemption for jobs with 4 hour or less wall time, submitted with normal QOS

Partition Name

Node Groups

Cores per Node

Node Mem. (GiB)

-C Constraint Flag

GPUs per Node

GPU Memory (MiB)

htcgpu1

cg32-[1-4]

40

92

V100_16, V100

4

16,160

htcgpu2

cg32-[5-6]

40

187

V100_32, V100

4

32,320

htcgpu3

gpu9-1

52

251

V100_16, V100

4

16,160

htcgpu4

gpu10-1

64

256

A100

6

40,536

htcgpu5

gpu11-1

56

192

A100

2

40,536

htcgpu6

gpu12-1

128

512

A100

4

81,251

htcgpu7

gpu13-1

128

512

A100

4

81,251

htcgpu8

gpu13-1

128

512

A100

4

81,251

Active RC GPU Partitions (All are exclusively accessible via the superset partition publicgpu with wildfire QOS) – Max Wall Time: 1 week

Partition Name

Node Groups

Cores per Node

Node Mem. (GiB)

-C Constraint Flag

GPUs per Node

GPU Memory (MiB)

rcgpu1

cg26-[1-2]

40

376

V100_32, V100

4

32,480

rcgpu3

s65-[1-3]

16

251

K40

2

11,441

rcgpu4

s71-1

16

62

GTX1080; K20

1

4,744

rcgpu6

gpu1-[1-2]

40

187

V100_16, V100

4

16,160

rcgpu7

gpu3-[2-30]

24

125

K80

2

11,441


Sending Jobs to a GPU Partition

To send jobs to one of these partitions, there are a few special parameters that need to be passed to slurm.

For example, if someone wanted to use two of the GPUs and one of the CPUs on a node in the gpu superset partition, then the header for his or her sbatch script would look something like this:

#!/bin/bash #SBATCH -N 1 # number of compute nodes #SBATCH -n 1 # number of tasks your job will spawn #SBATCH --mem=4G # amount of RAM requested in GiB (2^40) #SBATCH -p publicgpu # Use gpu partition #SBATCH -q wildfire # Run job under wildfire QOS queue #SBATCH --gres=gpu:2 # Request two GPUs #SBATCH -t 0-12:00 # wall time (D-HH:MM) #SBATCH -o slurm.%j.out # STDOUT (%j = JobId) #SBATCH -e slurm.%j.err # STDERR (%j = JobId) #SBATCH --mail-type=ALL # Send a notification when a job starts, stops, or fails #SBATCH --mail-user=myemail@asu.edu # send-to address ... nvidia-smi # Useful for seeing GPU status and activity ./myresearchjob ...

Most of the options used in the script above are explained in our sbatch documentation.

Here is an explanation of the new options needed for GPU jobs:

  • #SBATCH -p gpu

    • This option defines the partition the job should be run on.

      This parameter is required. If you are affiliated with a group that owns the GPU partition, you should replace this with the partition name, e.g., cidsegpu1, physicsgpu1, etc.

  • #SBATCH -q wildfire

    • This option defines the QOS queue that the job will be run under.

    • With the exception of the researchers who purchased these nodes, all users are required to use the wildfire QOS queue. On private, fully-opportunistic partitions, only the owners may run non-preemptable jobs on them.

    • What this means is that your jobs (submitted to private, fully-opportunistic hardware) may be preempted (cancelled) to allow jobs from hardware owners to run.

    • Jobs submitted to publicgpu are non-preemptable.

    • For jobs that are 4 hours or less, the htcgpu partition may be accessed with the normal QOS. These jobs are non-preemptable.

  • #SBATCH --gres=gpu:2

    • This tells the system that you want to request two GPUs

    • This option is required.

    • You can request a single GPU by putting a 1 at the end of this line instead of 2.

    • The number of GPUs you can request depends on the partition used e.g.:

      • For the physicsgpu1 partition, the limit is 4

      • For the cidsegpu1 partition, the limit is 4

      • For the sulcgpu1 partition, the limit is 8

      • et cetera

    • The GPU type may be specified with the additional notation, for instance: --gres=gpu:GTX1080:1 requests a GTX1080. This may also be done with the constraint flag, detailed below.

Advanced options:

  • #SBATCH -C V100

    • This is an optional component that helps specify constraints on your submitted job. Such constraints might be ensuring you are on a GPU with 32GB of memory, instead of 16GB which is more common.

    • Other such available constraints are: K20, K40, K80, GTX1080, RTX2080, V100, V100_16 , V100_32, or A100.

    • It is always recommended for users of the gpu partition to specify their preferred type of GPU via this -C flag rather than using the corresponding partition.


Cuda Versions Available on Agave

Cuda is a parallel computing platform and programming model developed by NVIDIA for general computing on graphical processing units (GPUs).

To check for Cuda versions on the supercomputer use the following command

module avail cuda