Job Array Examples

Here are some real use cases of Slurm Job Array.

Case one: Bioinformatics Essential - Bulk BLAST Query

In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) again genomic databases. This example is about using blastx on a folder containing 100 fasta files, and search these files against four existing databases. So the job array will invoke totally 100*4=400 sub-jobs. The example will cover all the steps needed, but assuming the mamba env, the blast software suit, and the blast databases are setup correctly.

Design the workflow

We will have a manifest file to feed inputs, which should be the location of the fasta files, and the name/location of the databases. Then a sbatch script to call blastx and generate the job array, and a single line of command to submit this sbtach script.

Generate the manifest file

parallel -k echo {} ::: db1 db2 db3 db4 ::: /dir/to/all/the/*.fasta > manifest

There are other ways to generate a manifest file with one to multiple columns, using parallel is one of the easiest ways. This line of code will mix and match the attributes {db1, db2, db3, db4} with all the fasta files in the given directory, then write the outputs line by line into a text file called manifest. Below is an example of the output. The -k flag means the sequence of the output lines should be the same order as the order of the given attributes.

# parallel -k echo {} ::: db1 db2 db3 db4 ::: /scratch/spock/dataset/*.fasta > manifest

db1 /scratch/spock/dataset/sample1.fasta
db2 /scratch/spock/dataset/sample1.fasta
db3 /scratch/spock/dataset/sample1.fasta
db4 /scratch/spock/dataset/sample1.fasta
db1 /scratch/spock/dataset/sample2.fasta
db2 /scratch/spock/dataset/sample2.fasta
db3 /scratch/spock/dataset/sample2.fasta
db4 /scratch/spock/dataset/sample2.fasta

Create the sbatch script
Run Simple Benchmarking Jobs to Help Estimate Wall Time