Page Comparison

Here are some real use cases of Slurm Job Array.

Case

...

One: Bioinformatics Essential - Bulk BLAST Query

In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) again genomic databases. This example is about using blastx on a folder containing 100 fasta files, and search these files against four existing databases. So the job array will invoke totally 100*4=400 sub-jobs. The example will cover all the steps needed, but assuming the mamba env, the blast software suit, and the blast databases are setup correctly.

...

The sbatch script runs from the submitting directory, the manifest file should be in the same directory of submission. After run, there will be two output files generated for each fasta file, one is fmt11 archive file, the other is the readable fmt6 file. The path used in the codes need to be carefully changed to reflect the actual directory structure.

...

Case Two: From Loops to Parallel

Multiple loops can be time consuming to run serially, and since each loop is similar, a better solution to this problem, is to run each loop as a single sub job in a job array.

Design the workflow

For example, in machine learning, a single time of training set and testing set split is always not representative enough to avoid bias. Even a single time of k-fold cross validation sometimes is still not enough. A solution is to set a loop across some integers, and use each integer as the seed of the random state variable, and run multiple splits using different random seeds. Thus we can simply send out each loop as a sub job using the sub job ID as the random seed.

Using a job array can even do more. If multiple pre-processing methods and different machine learning models need to be compared against each other on the same input data, we can set up those choices as different attributes in our manifest file, and generate sub jobs for each combination.

This is a more complicated example, and we need a manifest file, a sbatch script, and a python script to run the actual machine learning training process.

Generate the manifest file

Code Block
parallel -k echo {} ::: $(seq 1 10) ::: none smote adasyn ::: RF XGB > manifest

The first attribute $(seq 1 10) means printing out integers from 1 to 10. Those are our random seeds, which also means each combination of pre-processing method and ML models will be repeated independently for 10 times.

The second attribute is the three choices for pre-processing the dataset, which are doing nothing, SMOTE algorithm for balancing labels, and ADASYN algorithm for balancing labels.

The third attribute is the two choice for ML algorithms, which are “RF” for Random Forest and “XGB” for XGBoost.

So totally we have 10*3*2=60 lines in the manifest file, or 60 sub jobs in this job array.

Create the sbatch script

Expand

Code Block

#!/bin/bash

#SBATCH -N 1            # number of nodes
#SBATCH -c 1            # number of cores
#SBATCH -t 4:00:00   # time in d-hh:mm:ss
#SBATCH -p general      # partition
#SBATCH -q public       # QOS
#SBATCH -e slurm.%A_%a.err # file to save job's STDERR
#SBATCH --export=NONE   # Purge the job-submitting shell environment
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=spock@asu.edu

manifest="${1:?ERROR -- must pass a manifest file}"
taskid=$SLURM_ARRAY_TASK_ID
loopidx=$(getline $taskid $manifest | cut -f 1 -d ' ')
scaler=$(getline $taskid $manifest | cut -f 2 -d ' ')
model=$(getline $taskid $manifest | cut -f 3 -d ' ')

module purge
module load mamba/latest
source activate ML_training_env

python ML_training.py $loopidx $scaler $model

Create the python script

Expand

Code Block

language	py

# not a complete list of imports
import pandas as pd
import numpy as np
import sklearn
from imblearn.over_sampling import SMOTE, ADASYN

# parsing inputs from the shell script
random_seed = sys.argv[1]
scaler = sys.argv[2]
model_choice = sys.argv[3]

# start all the fancy training from here

Versions Compared

Old Version 3

New Version 4

Key

Case

One: Bioinformatics Essential - Bulk BLAST Query

Case Two: From Loops to Parallel