Slurm - SBATCH Job Scripts
Overview
SBATCH job scripts (Slurm batch job scripts) are the conventional way to do work on the supercomputer. In these scripts, you request a resource allocation and define the work to be done. They are sometimes also referred to as job scripts, submission scripts, submission files, or Slurm scripts.
Other resource managers such as PBS use different types of job submission scrips. ASU uses the Slurm resource manager, so all job scripts must be in the SBATCH format. If you are familiar with another resource manager, see the Rosetta Stone of Resource Managers
An SBATCH script is just a bash script with special SBATCH headers that define the resource allocation of the job and are interpreted by the Slrum resource manager before executing on compute hardware as a job
Submitting and Managing SBATCH Job Scripts
Submitting SBATCH Job Scripts
sbatch <scriptName>
For example, you have an SBATCH script has been saved as the file myscript.sh
you can submit this script with the following command from any node on the supercomputer:
sbatch myscript.sh
Replace myscript.sh with the name of your sbatch script.
Modifying a Job At Submission Time
You can override the SBATCH header options inside the script at the time of submission by specifying the flags with the sbatch command.
sbatch [optional flags] <scriptName>
Example
You have a SBATCH job script called bigjob.sh
that requests 48 CPU cores and the public QOS inside the script. You can submit this same script to the debug partition with 16 CPU cores, overriding the SBATCH headers inside the script with the following command:
Canceling a Job
You can cancel a pending or running job with its job ID. You can find the ID of all your running jobs with the command myjobs
, then the command scancel
to cancel the job.
Example
Updating a pending job
If you submitted a job and it is still PENDING, you can make some changes to the requested resources with the scontrol update
command.
For example, if you have submitted job with the job ID of 11254871 that was supposed to allocate 1 CPU core, but realize it should have 4 CPU cores, you can run the following:
More Examples:
Example SBATCH Scripts
More examples are available under:
/packages/public/sol-sbatch-templates/templates/
/packages/public/phx-sbatch-templates/templates/
Running sleep commands in sbatch jobs is against our policy and the job will be purged.
Simple Job
Here is a skeleton of a sample serial job that runs a python script from our home directory
MPI (Parallel) Job
See Running MPI Software on Sol or Running MPI Software on Phx
The main difference between serial and parallel jobs is that parallel jobs allocate a specified number of tasks using the -n
flag, in addition to the number of cores per task using the -c
flag. If -c
is not defined, it will default to 1, which is fine in most cases. A task is a ‘slot’ within a job that an MPI process can bind to, and a core is a physical CPU core on the supercomputer that does work. MPI processes can bind to multiple tasks across a single or multiple nodes to parallelize the work of a supported application.
Software for MPI is called with an mpi execution utility, such as srun or mpirun. Which one to use depends on how the software was compiled.
Job Arrays
A job array is a way to submit and manage a collection of similar jobs simultaneously. Instead of submitting individual job scripts for each task, you can use a job array to submit a set of related tasks as a single entity. This can be particularly useful when you have a large number of similar jobs to run, such as running the same computation with different input parameters. Typically, a manifest file is used to define these input parameters.
For example, say you want to run a Python script 20 times using a different input file. We can generate a manifest file that lists the input files. We can then reference this in the SBATCH job array script and run the script with different input data.
Please check this page Job Array Examples for more detailed examples of Job Array.
Sample SBATCH Job Script Submission
Â
Troubleshooting
Problems with batch files created under Windows
If you are creating an sbatch
script on a Windows system which you then move to a Linux-based supercomputer, you may see the following error when you try to run it:
This is because of a difference between the way that Windows and Unix systems define the end of a line in a text file.
If you see this error, the dos2unix utility can convert the script to the proper format.
This Does Not Look Like a Bash Script
If you see the error:
This is because your SBATCH script is missing the shellbang line
The very first line in any SBATCH script should be:
Â
Invalid Feature Specification
This error is because of an invalid partition, QoS or constraint. Check that your script is using the proper QoS and partition for the supercomputer and that any constraints are valid Partitions and QoS
Slurm Exit Codes
For reference, a guide for exit codes:
0 → success
non-zero → failure
Exit code 1 indicates a general failure
Exit code 2 indicates incorrect use of shell builtins
Exit codes 3-124 indicate some error in job (check software exit codes)
Exit code 125 indicates out of memory
Exit code 126 indicates command cannot execute
Exit code 127 indicates command not found
Exit code 128 indicates invalid argument to exit
Exit codes 129-192 indicate jobs terminated by Linux signals
For these, subtract 128 from the number and match to signal code
Enter
kill -l
to list signal codesEnter
man signal
for more information
Additional Help
Â