Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) against genomic databases. This example is about using blastx Blastx on a folder containing 100 fasta files, and search searching these files against four existing databases. So the job array will invoke totally a total of 100*4=400 sub-jobs. The example will cover all the steps needed, assuming the mamba env, the blast Blast software suite, and the blast Blast databases are setup set up correctly.

Info

Check out the /data/datasets/community/directory for ready-to-use blast databases.

...

This step is not a part of setting up the job array, but is very important for estimating a good wall time and core numbers. Although lots of sbatch parameters are given in the sbatch script above, they can be overwritten in the command line directly. In the commands below, -a means the sub-job number in the job array, -c will overwrite the core number required in the script.

So all the three commands below are using the first sub job to do the benchmarking, which is the first line in our manifest file, querying /scratch/spock/dataset/sample1.fasta against db1.

...

More cores can be tested. After all the tests are completed, we can decide on a the proper number of cores for each sub job (the -c value), and also the total wall time needed. The total wall time must be more than the wall time needed for a sub job.

Info

Blast According to our observations, BLAST does not scale linearly, more cores wouldn’t necessarily lead to shorter run time.

Since we are using sbatch scripts to submit our jobs, we can also use some slurm commands to investigate the efficiency of these benchmarking runs:

Code Block
$ myacct
## This command will print out all the recent jobs run on your account

$ seff <jobID>
## e.g. seff 12346788_1
## This command will print out the core utility rates and other job statistics.
## You can make a short script to loop through all the sub-jobs, 
## or just check a few of them, 
## or include this line at the end of the computation codes, 
## like after line 40 in step 3.

  1. Run the entire job array

First, find out how many rows there are in the manifest file, it is the total sub-job number. For this example we have totally a total 8 sub-jobs, so the command to submit the job array is:

...

The sbatch script runs from the submitting directory, the manifest file should be in the same directory of submission. After the run, there will be two output files generated for each fasta file, one is the fmt11 archive file, and the other is the readable fmt6 file. The path used in the codes need needs to be carefully changed to reflect the actual directory structure.

  1. Run the failed sub-jobs individually

Sometimes, we may have failed sub-jobs among other successful sub-jobs in a job array. We will need to find the job ID of these failed jobs. To run them again, replace X with their sub-job ID in the command below:

Code Block
sbatch -a X blastx_fasta_array.sh manifest

...

Case Two: From Loops to Parallel

...

The first attribute $(seq 1 10) means printing out integers from 1 to 10. Those are our random seeds, which also means each combination of the pre-processing method and ML models will be repeated independently for 10 times.

The second attribute is the three choices for pre-processing the dataset, which are doing nothing, the SMOTE algorithm for balancing labels, and the ADASYN algorithm for balancing labels.

The third attribute is the two choices for ML algorithms, which are “RF” for Random Forest and “XGB” for XGBoost.

So totally total we have 10*3*2=60 lines in the manifest file, or 60 sub-jobs in this job array.

  1. Create the sbatch script

...