Here are some real use cases of Slurm Job Array.
Case one: Bioinformatics Essential - Bulk BLAST Query
In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) again genomic databases. This example is about using blastx on a folder containing 100 fasta files, and search these files against four existing databases. So the job array will invoke totally 100*4=400 sub-jobs. The example will cover all the steps needed, but assuming the mamba env, the blast software suit, and the blast databases are setup correctly.
Design the workflow
We will have a manifest file to feed inputs, which should be the location of the fasta files, and the name/location of the databases. Then a sbatch script to call blastx and generate the job array, and a single line of command to submit this sbtach script.
Generate the manifest file
Code Block |
---|
parallel -k echo {} ::: db1 db2 db3 db4 ::: /dir/to/all/the/*.fasta > manifest |
There are other ways to generate a manifest file with one to multiple columns, using parallel
is one of the easiest ways. This line of code will mix and match the attributes {db1, db2, db3, db4}
with all the fasta files in the given directory, then write the outputs line by line into a text file called manifest. Below is an example of the output. The -k
flag means the sequence of the output lines should be the same order as the order of the given attributes.
Code Block |
---|
# parallel -k echo {} ::: db1 db2 db3 db4 ::: /scratch/spock/dataset/*.fasta > manifest
db1 /scratch/spock/dataset/sample1.fasta
db2 /scratch/spock/dataset/sample1.fasta
db3 /scratch/spock/dataset/sample1.fasta
db4 /scratch/spock/dataset/sample1.fasta
db1 /scratch/spock/dataset/sample2.fasta
db2 /scratch/spock/dataset/sample2.fasta
db3 /scratch/spock/dataset/sample2.fasta
db4 /scratch/spock/dataset/sample2.fasta |
Create the sbatch script
Run Simple Benchmarking Jobs to Help Estimate Wall Time