Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Info

Check out the /data/datasets/community/directory for ready-to-use blast databases.

  1. Design the workflow

We will need a manifest file to feed inputs, which should have the location of the fasta files, and the name/location of the databases. Then a sbatch script to call BLASTx program and generate the job array, and a single line of command to submit this sbtach script.

...

  1. Generate the manifest file

There are other ways to generate a manifest file with one to multiple columns, using parallel is one of the easiest ways. Parallel is available as a software module and can be installed with mamba in a python env on the supercomputers, while this step can also be performed on your local computer with parallel installed. To find the parallel module on supercomputers:

Code Block
module avail parallel  # to find the correct module name
module load parallel-20220522-gcc-12.1.0  # for Sol
module load parallel-20220522-ie          # for Phoenix
Code Block
parallel -k echo {} ::: db1 db2 db3 db4 ::: /dir/to/all/the/*.fasta > manifest

This line of code will mix and match the attributes {db1, db2, db3, db4} with all the fasta files in the given directory, then write the outputs line by line into a text file called manifest. Below is an example of the output. The -k flag means the sequence of the output lines should be in the same order as the order of the given attributes.

Expand
Code Block
# parallel -k echo {} ::: db1 db2 db3 db4 ::: /scratch/spock/dataset/*.fasta > manifest

db1 /scratch/spock/dataset/sample1.fasta
db2 /scratch/spock/dataset/sample1.fasta
db3 /scratch/spock/dataset/sample1.fasta
db4 /scratch/spock/dataset/sample1.fasta
db1 /scratch/spock/dataset/sample2.fasta
db2 /scratch/spock/dataset/sample2.fasta
db3 /scratch/spock/dataset/sample2.fasta
db4 /scratch/spock/dataset/sample2.fasta

...

Multiple loops can be time consuming to run serially, and since each loop is similar, a better solution to this problem, is to run each loop as a single sub job in a job array.

  1. Design the workflow

For example, in machine learning, a single time of training set and testing set split is always not representative enough to avoid bias. Even a single time of k-fold cross validation sometimes is still not enough. A solution is to set a loop across some integers, and use each integer as the seed of the random state variable, and run multiple splits using different random seeds. Thus we can simply send out each loop as a sub job using the sub job ID as the random seed.

Using a job array can even do more. If multiple pre-processing methods and different machine learning models need to be compared against each other on the same input data, we can set up those choices as different attributes in our manifest file, and generate sub jobs for each combination.

This is a more complicated example, and we need a manifest file, a sbatch script, and a python script to run the actual machine learning training process. Please refer to case 1 for more explanations.

...

  1. Generate the manifest file

Code Block
parallel -k echo {} ::: $(seq 1 10) ::: none smote adasyn ::: RF XGB > manifest

The first attribute $(seq 1 10) means printing out integers from 1 to 10. Those are our random seeds, which also means each combination of the pre-processing method and ML models will be repeated independently 10 times.

The second attribute is the three choices for pre-processing the dataset, which are doing nothing, the SMOTE algorithm for balancing labels, and the ADASYN algorithm for balancing labels.

The third attribute is the two choices for ML algorithms, which are “RF” for Random Forest and “XGB” for XGBoost.

So total we have 10*3*2=60 lines in the manifest file, or 60 sub-jobs in this job array.

  1. Create the sbatch script

Expand
Code Block
#!/bin/bash

#SBATCH -N 1            # number of nodes
#SBATCH -c 1            # number of cores
#SBATCH -t 4:00:00   # time in d-hh:mm:ss
#SBATCH -p general      # partition
#SBATCH -q public       # QOS
#SBATCH -e slurm.%A_%a.err # file to save job's STDERR
#SBATCH --export=NONE   # Purge the job-submitting shell environment
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=spock@asu.edu

manifest="${1:?ERROR -- must pass a manifest file}"
taskid=$SLURM_ARRAY_TASK_ID
loopidx=$(getline $taskid $manifest | cut -f 1 -d ' ')
scaler=$(getline $taskid $manifest | cut -f 2 -d ' ')
model=$(getline $taskid $manifest | cut -f 3 -d ' ')

module purge
module load mamba/latest
source activate ML_training_env

python ML_training.py $loopidx $scaler $model

...