Here are some real use cases of Slurm Job Array.
Case
...
One: Bioinformatics Essential - Bulk BLAST Query
In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) again genomic databases. This example is about using blastx on a folder containing 100 fasta files, and search these files against four existing databases. So the job array will invoke totally 100*4=400 sub-jobs. The example will cover all the steps needed, but assuming the mamba env, the blast software suit, and the blast databases are setup correctly.
...
The sbatch script runs from the submitting directory, the manifest file should be in the same directory of submission. After run, there will be two output files generated for each fasta file, one is fmt11 archive file, the other is the readable fmt6 file. The path used in the codes need to be carefully changed to reflect the actual directory structure.
...
Case Two: From Loops to Parallel
Multiple loops can be time consuming to run serially, and since each loop is similar, a better solution to this problem, is to run each loop as a single sub job in a job array.
Design the workflow
For example, in machine learning, a single time of training set and testing set split is always not representative enough to avoid bias. Even a single time of k-fold cross validation sometimes is still not enough. A solution is to set a loop across some integers, and use each integer as the seed of the random state variable, and run multiple splits using different random seeds. Thus we can simply send out each loop as a sub job using the sub job ID as the random seed.
Using a job array can even do more. If multiple pre-processing methods and different machine learning models need to be compared against each other on the same input data, we can set up those choices as different attributes in our manifest file, and generate sub jobs for each combination.
This is a more complicated example, and we need a manifest file, a sbatch script, and a python script to run the actual machine learning training process.
Generate the manifest file
Code Block |
---|
parallel -k echo {} ::: $(seq 1 10) ::: none smote adasyn ::: RF XGB > manifest |
The first attribute $(seq 1 10)
means printing out integers from 1 to 10. Those are our random seeds, which also means each combination of pre-processing method and ML models will be repeated independently for 10 times.
The second attribute is the three choices for pre-processing the dataset, which are doing nothing, SMOTE algorithm for balancing labels, and ADASYN algorithm for balancing labels.
The third attribute is the two choice for ML algorithms, which are “RF” for Random Forest and “XGB” for XGBoost.
So totally we have 10*3*2=60 lines in the manifest file, or 60 sub jobs in this job array.
Create the sbatch script
Expand | ||
---|---|---|
|
Create the python script
Expand | |||||
---|---|---|---|---|---|
|