Important Terms
Node: A single machine in a supercomputer. This will be either a physical machine or a virtual machine.
Scheduler: The application on our end that assigns compute resources for jobs.
Slurm: The brand name of our scheduler which manages and allocates resources.
MPI: Message Passing Interface (MPI) is a standardized and portable function used on parallel computing architectures.
SBATCH: This is a means of submitting a batch job, from which tasks will be executed, to execute on allocated resources.
What is a Job?
A job is any work submitted to the supercomputer that requests or uses its resources. There are three types of jobs that a user can request:
An interactive graphical application, such as Jupyter, RStudio, or MATLAB. This is an ideal option for new users as they become familiar with the supercomputer. This is also ideal for code that is being developed or tested before it is submitted as a batch job.
An interactive shell session. This allows the user to continue interacting with the node through the shell while the job is running. This is also ideal for code that is being developed or for applications that do not have a graphical interface.
A batch job. This is a job that is submitted as a complete script and runs unattended. This is done through a
SBATCH
script.
Understanding Resources
There are five main resources for all jobs:
Partition: This is a group of nodes that share a similar feature or characteristic.
QOS: This is the quality of service (QOS) required by a partition.
CPUs: Short for “Central Processing Unit'', also called a core. This is the core component that defines a computing device, such as a node.
Session Wall Time: The actual time taken from the start of a computing program to the end. This is the amount of time that users anticipate their jobs needing to complete.
GPUs: Short for “Graphic Processing Unit”. This is a specialized piece of hardware that can enable and accelerate certain computational research.
Selecting a Partition
There is a single partition (general
) and a single QOS (normal
) in the Aloe environment. To improve ease of use, these values will be used by default and may be left out of job scripts for simplicity.
Requesting CPUs
When requesting resources, it is best for performance to keep all the resources close together. This is best accomplished by requesting each of the cores you request stay on the same node with -N
.
To request a given number of CPUs sharing the same node, you can use the following in your SBATCH
script:
#SBATCH -c 5 # CPUs per TASK #SBATCH -N 1 # keep all tasks on the same node
To request a given number of CPUs spread across multiple nodes, you can use the following:
#SBATCH -c 5 # CPUs per TASK #SBATCH -n 10 # number of TASKS #SBATCH -N 10 # allow tasks to spread across multiple nodes (MIN & MAX)
The above example will allocate 50 cores, 5 cores or “workers” per task, on 10 independent nodes.
Take note of the inclusion or omission of -N
:
#SBATCH -c 5 # CPUs per TASK #SBATCH -n 10 # number of TASKS
This reduced example will still allocate 50 cores with 5 cores per task on any number of available nodes. Note, since there is no MPI capability in the Aloe environment, you will likely always add -N 1
. This will ensure that each job’s worker has the lowest latency to each other.
As a general rule, CPU-only nodes have 128 cores and GPU-present nodes have 48 cores.
Requesting Memory
On Aloe, cores and memory are de-coupled. If you need only a single CPU but require ample memory, you can do so like this:
#SBATCH -c 1 # number of TASKS #SBATCH -N 1 # keep all tasks on the same node #SBATCH --mem=120G # request 120 GB of memory
Every node has at least 500 GB of memory.
The largest amount of memory available on a single node is 2 TB.
Requesting GPUs
Aloe offers 20 Nvidia A100 GPUs with 2 GPUs available per GPU-present node.
These GPUs are publicly available and usable up to the maximum of 28 day wall time.
Requesting GPUs for an interactive job can be done with:
interactive salloc -G a100:1 -- or -- interactive salloc -G a100:2
SBATCH scripts can request GPUs with this format: -G [type]:[quantity]
#SBATCH -N 1 ... #SBATCH -G a100:1
Minimum Memory Allocation
To best maximize GPU throughput, requested system memory (RAM) should be set to match at least the aggregate amount of GPU RAM allocated. To help ensure this efficiency, all GPUs are accompanied with a minimum of 40GB of system RAM per GPU.
For example, if you request 2 A100s, your job can have no less than 80GB of system RAM. Naturally, you may choose to ask for more than 80GB, e.g., 500GB.
#SBATCH --gres=gpu:a100:2 #SBATCH --mem=50G # Even though you requested 50GB, it will be MAX(50,80) = 80GB
Techniques to Determine Resource Usage
At the end of each job, you can execute the seff
command to get a Slurm EFFiciency
report. This report will tell you how much of your requested cores and memory were utilized:
Job ID: 12345 Cluster: cluster User/Group: wdizon/wdizon State: COMPLETED (exit code 0) Nodes: 1 Cores per node: 23 CPU Utilized: 4-13:43:50 CPU Efficiency: 2.84% of 161-01:57:18 core-walltime Job Wall-clock time: 7-00:05:06 Memory Utilized: 133.85 GB Memory Efficiency: 66.93% of 200.00 GB
Understanding seff
Efficiency Numbers
CPU Utilized: 4-13:43:50 CPU Efficiency: 2.84% of 161-01:57:18 core-walltime
The above job ran for 4 days and ~14 hours of the 7 days requested.
Efficiency is not about the amount of runtime for a job (4 days vs 7 days), but the amount of CPU utilized compared to the amount allocated (this will not include unused time after the completion of execution). This job shows a clear affinity for being a memory-bound job, with an over-allocation of CPUs.
Such a job might instead be modified to the following sbatch
:
#SBATCH -c 23 #SBATCH -N 1 #SBATCH --mem=200G #SBATCH -t 7-00:00:00
Modified to:
#SBATCH -c 1 # roughly only one core was used #SBATCH -N 1 #SBATCH --mem=150G # 67% of 200GB, rounding up a bit
While it is possible to reduce -t 5-00:00:00
, it is not required for efficiency calculations.