Understanding the Status Page and Private Nodes

The Sol status page show the current capacity of the Sol supercomputer and CPUs and GPUs that are available to users. This page will help demonstrate what each of the colors represent and how one might request resources when these nodes appear unutilized.

 

Legend Description

  • RED: All available CPU cores of this node are used. This node is effectively 100% in-use.

    All GPUs in-use

     

  • ORANGE: Some CPUs are allocated, some CPUs are available.

     

  • GREEN: All CPUs are available, if applicable, GPUs are available.

     

  • BLUE: Host is unavailable for new jobs.

     

How to Request Resources that are Seemingly Unused

Not all compute nodes are created equal: some nodes are owned by Research Computing, and some are privately-owned by individuals, labs, groups, or other entities. Below is a list of nodes whose availability has special considerations:

  • c001, c002: “lightwork” partition, aimed at “light” workloads, up to 24 hours (oversubscribed).

    A “light” workload is any workload that has a significant amount of idle time between commands. A popular example would be for VScode, moving files, (un)tarring, etc. These tasks generally will leave large gaps of CPU time idle and unused: oversubscribing allows multiple users to share cores, and put good use to the otherwise-idle CPUs.

    Some ways you might use this partition:
    $ vscode – From the login node, this will allocate a job on the lightwork partition and start a VScode server on a compute node.
    $ interactive -p lightwork -t 1-0 – From a login node, this will allocate an interactive session on a compute node.

  • c003, c004: “htc” nodes, aimed at normal compute workloads, up to 4 hours


    Though almost all nodes on the supercomputer are eligible to run htc jobs, these two nodes exclusively run htc jobs. The purpose of these nodes is to provide additional capacity for short-running jobs. Often, the supercomputer will be completely saturated and requests for resources may take additional time. If you need to perform interactive work, or have standard compute work to run that can finish within four hours, this is a great set of nodes to work with.


    Some ways you might use this partition:
    $ interactive -p htc -t 0-4 – From a login node, will allocate an interactive session with a time limit of four hours.
    #SBATCH -p htc -t 0-4 – From an SBATCH script, using this and any other available nodes.

  • cg001 - cg005, ch001 - ch005, g230 - g235: “private” nodes, CPU and GPU nodes with risk of preemption.

    These nodes are privately-owned by individuals, labs, groups, or other non-Research Computing entities. They are freely available for use with certain caveats: 1) they can be used for htc jobs without any risk of preemption (job cancellation) using the public QOS, or 2) they can be used for longer runtimes than htc’s 4 hours with the priavte QOS, but with risk of preemption only from jobs by the owning group.

    Take note, ch prefixed nodes are high-memory nodes, and therefore can only be used with the highmem partition. cg prefixed nodes are nodes with GPU accelerators.

    Some ways you might use this partition:
    $ interactive -p htc -t 0-4 – From a login node, you may be allocated to this node, or other nodes, which identically handle incoming htc jobs.
    $ interactive -p general -q private – From a login node, you will be allocated to the first-available core from privately-owned hardware, for any duration (preemptable).
    $ interactive -p highmem -q private -w ch001 – From a login node, request a high memory node, specifically the ch001 node.
    $ interactive -p general -q private -w cg001 -G 1 – From a login node, request the first-available GPU node, specifically the cg001 node.

  • g048, g049: “gpu slices”, nodes with 28 x NVIDIA A100 GPU slices with 10GB of memory.

Each circle represents 7 slices of a single A100 GPU. One slice boasts the performance and capabilities of an A100, computationally, but with only 10GB of memory, rather than 80GB or 40GB. If 10GB is sufficient memory for your job, these nodes promise comparable performance but have immensely greater scheduling availability.

Notably, if a given GPU process requires 10GB or less of memory, this means a job can use more than 4 GPUs, even up to 7 GPUs (7 slices to a complete A100) or 28 GPUs (7 slices x 4 A100s). Though the unsliced A100 performs best in raw capability, jobs that benefit from the broadened parallelism of 28 simultaneous GPUs might perform better in unique cases.

Some ways you might use this partition:
$ interactive -p general -q public --gres=gpu:1g.10gb:7 – From a login node, request seven GPU slices.

An A100 slice has the same CUDA capabilities as a complete A100. This means it is possible to start and test jobs with a slice to confirm your pipeline is operational, e.g., test that Python environments have CUDA, can identify the GPU, etc.

Additional Help