Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 12 Next »

Here are some real use cases of Slurm Job Array.

Case One: Bioinformatics Essential - Bulk BLAST Query

In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) against genomic databases. This example is about using blastx on a folder containing 100 fasta files, and search these files against four existing databases. So the job array will invoke totally 100*4=400 sub-jobs. The example will cover all the steps needed, but assuming the mamba env, the blast software suit, and the blast databases are setup correctly.

Check out the /data/datasets/community/directory for already downloaded blast databases.

  1. Design the workflow

We will have a manifest file to feed inputs, which should be the location of the fasta files, and the name/location of the databases. Then a sbatch script to call blastx and generate the job array, and a single line of command to submit this sbtach script.

  1. Generate the manifest file

There are other ways to generate a manifest file with one to multiple columns, using parallel is one of the easiest ways. Parallel is available as a software module and can be installed with mamba in a python env on the supercomputers, while this step can also be performed on your local computer with parallel installed. To find the parallel module on supercomputers:

module avail parallel.  # to find the correct module name
module load parallel-20220522-gcc-12.1.0  # for Sol
module load parallel-20220522-ie          # for Phoenix
parallel -k echo {} ::: db1 db2 db3 db4 ::: /dir/to/all/the/*.fasta > manifest

This line of code will mix and match the attributes {db1, db2, db3, db4} with all the fasta files in the given directory, then write the outputs line by line into a text file called manifest. Below is an example of the output. The -k flag means the sequence of the output lines should be the same order as the order of the given attributes.

 Click here to expand...
# parallel -k echo {} ::: db1 db2 db3 db4 ::: /scratch/spock/dataset/*.fasta > manifest

db1 /scratch/spock/dataset/sample1.fasta
db2 /scratch/spock/dataset/sample1.fasta
db3 /scratch/spock/dataset/sample1.fasta
db4 /scratch/spock/dataset/sample1.fasta
db1 /scratch/spock/dataset/sample2.fasta
db2 /scratch/spock/dataset/sample2.fasta
db3 /scratch/spock/dataset/sample2.fasta
db4 /scratch/spock/dataset/sample2.fasta
  1. Create the sbatch script

 Click here to expand...
#!/bin/bash
#SBATCH -c 8            # number of "cores"
#SBATCH -t 4:00:00     # time in d-hh:mm:ss
#SBATCH -p serial       # partition
#SBATCH -q normal       # QOS
#SBATCH -e slurm.%A_%a.err # file to save STDERR for each sub-job
#SBATCH --export=NONE   # Purge the job-submitting shell environment
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=spock@asu.edu

# process the manifest file
manifest="${1:?ERROR -- must pass a manifest file}"
taskid=$SLURM_ARRAY_TASK_ID
case=$(getline $taskid $manifest | cut -f 1 -d ' ')
fasta=$(getline $taskid $manifest | cut -f 2 -d ' ')

# set up sbatch parameters and env
module purge
module load anaconda/py3
source activate bat

# set up input and output file names
base=$(basename -s .fasta $fasta)
out1=/scratch/name/"$base"_${case}_fmt11.txt
out2=/scratch/name/"$base"_${case}_fmt6.txt

# put all the blastx flags here just for the sake of formatting
args=(
 -query $fasta
 -db /scratch/name/path/db/$case/$case
 -evalue 1e-3
 -num_threads $(nproc)
 -max_target_seqs 5
 -max_hsps 1
 -outfmt '11'
 -out $out1
)
blastx "${args[@]}"

blast_formatter -archive $out1 -outfmt "6 qseqid sseqid evalue" -out $out2
  1. Benchmarking

This step is not a part of setting up job array, but very important for estimating a good wall time and core numbers. Although lots of sbatch parameters are given in the sbatch script above, they can be overwritten in command line directly. In the commands below, -a means the sub job number in the job array, -c will overwrite the core number required in the script.

So all the three commands below are using the first sub job to do the benchmarking, which is the first line in our manifest file, querying /scratch/spock/dataset/sample1.fasta against db1.

sbatch -a 1 -c 1 blastx_fasta_array.sh manifest
sbatch -a 1 -c 4 blastx_fasta_array.sh manifest
sbatch -a 1 -c 8 blastx_fasta_array.sh manifest

More cores can be tested. After all the tests are completed, we can decide on a proper number of cores for each sub job (the -c value), and also the total wall time needed. The total wall time must be more than the wall time needed for a sub job.

Blast does not scale linearly, more cores wouldn’t necessarily lead to shorter run time.

  1. Run the entire job array

First find out how many rows there are in the manifest file, it is the total sub job number. For this example we have totally 8 sub jobs, so the command to submit the job array is:

sbatch -a 1-8 blastx_fasta_array.sh manifest

The sbatch script runs from the submitting directory, the manifest file should be in the same directory of submission. After run, there will be two output files generated for each fasta file, one is fmt11 archive file, the other is the readable fmt6 file. The path used in the codes need to be carefully changed to reflect the actual directory structure.


Case Two: From Loops to Parallel

Multiple loops can be time consuming to run serially, and since each loop is similar, a better solution to this problem, is to run each loop as a single sub job in a job array.

  1. Design the workflow

For example, in machine learning, a single time of training set and testing set split is always not representative enough to avoid bias. Even a single time of k-fold cross validation sometimes is still not enough. A solution is to set a loop across some integers, and use each integer as the seed of the random state variable, and run multiple splits using different random seeds. Thus we can simply send out each loop as a sub job using the sub job ID as the random seed.

Using a job array can even do more. If multiple pre-processing methods and different machine learning models need to be compared against each other on the same input data, we can set up those choices as different attributes in our manifest file, and generate sub jobs for each combination.

This is a more complicated example, and we need a manifest file, a sbatch script, and a python script to run the actual machine learning training process.

  1. Generate the manifest file

parallel -k echo {} ::: $(seq 1 10) ::: none smote adasyn ::: RF XGB > manifest

The first attribute $(seq 1 10) means printing out integers from 1 to 10. Those are our random seeds, which also means each combination of pre-processing method and ML models will be repeated independently for 10 times.

The second attribute is the three choices for pre-processing the dataset, which are doing nothing, SMOTE algorithm for balancing labels, and ADASYN algorithm for balancing labels.

The third attribute is the two choice for ML algorithms, which are “RF” for Random Forest and “XGB” for XGBoost.

So totally we have 10*3*2=60 lines in the manifest file, or 60 sub jobs in this job array.

  1. Create the sbatch script

 Click here to expand...
#!/bin/bash

#SBATCH -N 1            # number of nodes
#SBATCH -c 1            # number of cores
#SBATCH -t 4:00:00   # time in d-hh:mm:ss
#SBATCH -p general      # partition
#SBATCH -q public       # QOS
#SBATCH -e slurm.%A_%a.err # file to save job's STDERR
#SBATCH --export=NONE   # Purge the job-submitting shell environment
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=spock@asu.edu

manifest="${1:?ERROR -- must pass a manifest file}"
taskid=$SLURM_ARRAY_TASK_ID
loopidx=$(getline $taskid $manifest | cut -f 1 -d ' ')
scaler=$(getline $taskid $manifest | cut -f 2 -d ' ')
model=$(getline $taskid $manifest | cut -f 3 -d ' ')

module purge
module load mamba/latest
source activate ML_training_env

python ML_training.py $loopidx $scaler $model
  1. Create the python script

 Click here to expand...
# not a complete list of imports
import pandas as pd
import numpy as np
import sklearn
from imblearn.over_sampling import SMOTE, ADASYN

# parsing inputs from the shell script
random_seed = sys.argv[1]
scaler = sys.argv[2]
model_choice = sys.argv[3]

# start all the fancy training from here

Additional Help

If you require further assistance on this topic, please contact the Research Computing Team. To create a support ticket review our RTO Request Help page. For quick inquiries, reach out via our #rc-support Slack Channel or attend our office hours for live assistance.

We also offer a series of Educational Opportunities and Workshops.

  • No labels