Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Important Terms

  • Node: A single machine in a supercomputer. This will be either a physical machine or a virtual machine. 

  • Scheduler:  The application on our end that assigns compute resources for jobs.

  • Slurm: The brand name of our scheduler which manages and allocates resources.

  • MPI: Message Passing Interface (MPI) is a standardized and portable function used on parallel computing architectures.

What is a Job?

A job is any work submitted to the supercomputer that requests or uses its resources. There are three types of jobs that a user can request:

  1. An interactive graphical application, such as Jupyter, RStudio, or MATLAB. This is an ideal option for new users as they become familiar with the supercomputer. This is also ideal for code that is being developed or tested before it is submitted as a batch job.

  2. An interactive shell session. This allows the user to continue interacting with the node through the shell while the job is running. This is also ideal for code that is being developed or for applications that do not have a graphical interface.

  3. A batch job. This is a job that is submitted as a complete script and runs unattended. This is done through a SBATCH script.

Understanding Resources

There are five main resources for all jobs:

  1. Partition: This is a group of nodes that share a similar feature or characteristic.

  2. QOS: This is the quality of service (QOS) required by a partition.

  3. CPUs: Short for “Central Processing Unit'', also called a core. This is the core component that defines a computing device, such as a node.

  4. Session Wall Time: The actual time taken from the start of a computing program to the end. This is the amount of time that users anticipate their jobs needing to complete.

  5. GPUs: Short for “Graphic Processing Unit”. This is a specialized piece of hardware that can enable and accelerate certain computational research.

Selecting a Partition

There is a single partition (general) and a single QOS (normal) in the Aloe environment. To improve ease of use, these values will be used by default and may be left out of jobscripts job scripts for simplicity.

Requesting CPUs

...

To request a given number of cpus CPUs sharing the same node, you can use the following in your SBATCH script:

Code Block
#SBATCH -c 5     # CPUs per TASK (in this ex. will
get 5 cores)
#SBATCH -N 1     # keep all tasks on the same node

To request a given number of cpus CPUs spread across multiple nodes, you can use the following:

...

The above example will allocate 50 cores, 5 cores or “workers” per task, on 10 independent nodes.

Take note of the inclusion or omission of -N:

...

This reduced example will still allocate 50 cores , with 5 cores per task on any number of available nodes. Note, that since there is no MPI capability in the Aloe environment, you will likely prefer to always add -N 1, to . This will ensure that each job job’s worker has the lowest latency to eachothereach other.

Info

As a general rule, CPU-only nodes have 128 cores and GPU-present nodes have 48 cores.

...

On Aloe, cores and memory are de-coupled: if . If you need only a single CPU core but require ample memory, you can do so like this:

Code Block
#SBATCH -c 1     # number of TASKS
#SBATCH -N 1     # keep all tasks on the same node
#SBATCH --mem=120G     # request 120 GB of memory
Info

Every single node has at least 500GB of memory, with High Memory Nodes having 2TB500 GB of memory.
The largest amount of memory available on a single node is 2 TB.

Requesting GPUs

Info

Aloe offers 20 Nvidia A100s (2xGPU per node). A100 GPUs with 2 GPUs available per GPU-present node.
These GPUs are publicly available and usable up to the maximum of 28 day walltimewall time.

Requesting GPUs for an interactive job can be done with interactive:

Code Block
interactive salloc -G a100:1
  -- or --
interactive salloc -G a100:2

SBATCH scripts can request GPUs with this format: -G [type]:[qtyquantity]

Code Block
#SBATCH -N 1
...
#SBATCH -G a100:1

...

For example, if you request 2 A100s, your job can have no less than 80GBof system RAM; naturally. Naturally, you may choose to ask for more than 80GB, e.g., 500GB.

...

Code Block
Job ID: 12345
Cluster: cluster
User/Group: wdizon/wdizon
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 23
CPU Utilized: 4-13:43:50
CPU Efficiency: 2.84% of 161-01:57:18 core-walltime
Job Wall-clock time: 7-00:05:06
Memory Utilized: 133.85 GB
Memory Efficiency: 66.93% of 200.00 GB

Understanding seff

...

Efficiency Numbers

Code Block
CPU Utilized: 4-13:43:50
CPU Efficiency: 2.84% of 161-01:57:18 core-walltime

The above job ran for 4 days , and ~14 hours of the 7 days requested.

Efficiency is not about the amount of runtime for a job (4d 4 days vs 7d7 days), but the amount of CPU utilized compared to the amount allocated (this will not include unused time after the completion of execution). This job shows a clear affinity for being a memory-bound job, with an over-allocation of CPUs.

...