Slurm - SBATCH Job Scripts

Overview

SBATCH job scripts (Slurm batch job scripts) are the conventional way to do work on the supercomputer. In these scripts, you request a resource allocation and define the work to be done. They are sometimes also referred to as job scripts, submission scripts, submission files, or Slurm scripts.

Other resource managers such as PBS use different types of job submission scrips. ASU uses the Slurm resource manager, so all job scripts must be in the SBATCH format. If you are familiar with another resource manager, see the Rosetta Stone of Resource Managers

An SBATCH script is just a bash script with special SBATCH headers that define the resource allocation of the job and are interpreted by the Slrum resource manager before executing on compute hardware as a job

Submitting and Managing SBATCH Job Scripts

Submitting SBATCH Job Scripts

sbatch <scriptName>

For example, you have an SBATCH script has been saved as the file myscript.sh you can submit this script with the following command from any node on the supercomputer:

sbatch myscript.sh

Replace myscript.sh with the name of your sbatch script.

Modifying a Job At Submission Time

You can override the SBATCH header options inside the script at the time of submission by specifying the flags with the sbatch command.

sbatch [optional flags] <scriptName>

Example

You have a SBATCH job script called bigjob.sh that requests 48 CPU cores and the public QOS inside the script. You can submit this same script to the debug partition with 16 CPU cores, overriding the SBATCH headers inside the script with the following command:

Canceling a Job

You can cancel a pending or running job with its job ID. You can find the ID of all your running jobs with the command myjobs, then the command scancel to cancel the job.

Example

Updating a pending job

If you submitted a job and it is still PENDING, you can make some changes to the requested resources with the scontrol update command.

For example, if you have submitted job with the job ID of 11254871 that was supposed to allocate 1 CPU core, but realize it should have 4 CPU cores, you can run the following:

More Examples:

Example SBATCH Scripts

More examples are available under:

/packages/public/sol-sbatch-templates/templates/

/packages/public/phx-sbatch-templates/templates/

Simple Job

Here is a skeleton of a sample serial job that runs a python script from our home directory

MPI (Parallel) Job

See ##TODO link to MPI docs

The main difference between serial and parallel jobs is that parallel jobs allocate a specified number of tasks using the -n flag, in addition to the number of cores per task using the -c flag. If -c is not defined, it will default to 1, which is fine in most cases. A task is a ‘slot’ within a job that an MPI process can bind to, and a core is a physical CPU core on the supercomputer that does work. MPI processes can bind to multiple tasks across a single or multiple nodes to parallelize the work of a supported application.

Software for MPI is called with an mpi execution utility, such as srun or mpirun. Which one to use depends on how the software was compiled. See this page ##TODO ADD LINK for more information on executing MPI software

Job Arrays

A job array is a way to submit and manage a collection of similar jobs simultaneously. Instead of submitting individual job scripts for each task, you can use a job array to submit a set of related tasks as a single entity. This can be particularly useful when you have a large number of similar jobs to run, such as running the same computation with different input parameters. Typically, a manifest file is used to define these input parameters.

For example, say you want to run a Python script 20 times using a different input file. We can generate a manifest file that lists the input files. We can then reference this in the SBATCH job array script and run the script with different input data.

 

Manifest file

Please check this page for more detailed examples of Job Array.


Sample SBATCH Job Script Submission

 

You can copy this into a file on a supercomputer save it as myjob.sh and submit it with the command sbatch myjob.sh

Explanation:

The actual script contents (interpreted by the path provided by the shell-bang line, #!/bin/bash) then follow (lines 13-38). Some of the lines are highlighted here:

  • Line 14: module load mamba/latest

    • This loads a python environment manager, mamba, which is used to create application-specific environments or activate existing ones. In this case, we’re going to activate an admin provided environment (Line 16).

  • Line 16: source activate scicomp

    • This activates a base scientific computing python environment

  • Lines 23-25

    • This loop generates one million uniformly-distributed random numbers and saves them to a file.

  • Lines 27-38

    • This section of the code runs a python script (provided via a heredoc for in-line visibility on this documentation page).

Scheduling the script for execution (sbatch myjob.sh) results in the scheduler assigning a unique integer, a job id, to the work. The status of the job can be viewed in the command-line via myjobs. Once the job is running, the status will change from PENDING to RUNNING, and when the job completes it will no longer be visible in the queue. Instead, the filesystem will contain slurm.%j.out and slurm.%j.err (not literally %j, this is a symbol for the job id) and the python generated plot Histogram.png(shown below).

 

 


Troubleshooting

Problems with batch files created under Windows

If you are creating an sbatch script on a Windows system which you then move to a Linux-based supercomputer, you may see the following error when you try to run it:

This is because of a difference between the way that Windows and Unix systems define the end of a line in a text file.

If you see this error, the dos2unix utility can convert the script to the proper format.

This Does Not Look Like a Bash Script

If you see the error:

This is because your SBATCH script is missing the shellbang line

The very first line in any SBATCH script should be:

 

Invalid Feature Specification

This error is because of an invalid partition, QoS or constraint. Check that your script is using the proper QoS and partition for the supercomputer and that any constraints are valid ##TODO Link to QoS page

Slurm Exit Codes

For reference, a guide for exit codes:

  • 0 → success

  • non-zero → failure

  • Exit code 1 indicates a general failure

  • Exit code 2 indicates incorrect use of shell builtins

  • Exit codes 3-124 indicate some error in job (check software exit codes)

  • Exit code 125 indicates out of memory

  • Exit code 126 indicates command cannot execute

  • Exit code 127 indicates command not found

  • Exit code 128 indicates invalid argument to exit

  • Exit codes 129-192 indicate jobs terminated by Linux signals

    • For these, subtract 128 from the number and match to signal code

    • Enter kill -l to list signal codes

    • Enter man signal for more information


Additional Help

Â