...
In bioinformatics research, “BLAST” is a tool to search some input DNA or RNA sequence files (.fasta files in this example, which are formatted text files) against genomic databases. This example is about using blastx Blastx on a folder containing 100 fasta files, and search searching these files against four existing databases. So the job array will invoke totally a total of 100*4=400 sub-jobs. The example will cover all the steps needed, assuming the mamba env, the blast Blast software suite, and the blast Blast databases are setup set up correctly.
Info |
---|
Check out the |
Design the workflow
We will need a manifest file to feed inputs, which should have the location of the fasta files, and the name/location of the databases. Then a sbatch script to call
...
BLASTx program and generate the job array, and a single line of command to submit this sbtach script.
...
Generate the manifest file
There are other ways to generate a manifest file with one to multiple columns, using parallel
is one of the easiest ways. Parallel
is available as a software module and can be installed with mamba
in a python env on the supercomputers, while this step can also be performed on your local computer with parallel
installed. To find the parallel module on supercomputers:
Code Block |
---|
module avail parallel. # to find the correct module name
module load parallel-20220522-gcc-12.1.0 # for Sol
module load parallel-20220522-ie # for Phoenix |
Code Block |
---|
parallel -k echo {} ::: db1 db2 db3 db4 ::: /dir/to/all/the/*.fasta > manifest |
This line of code will mix and match the attributes {db1, db2, db3, db4}
with all the fasta files in the given directory, then write the outputs line by line into a text file called manifest. Below is an example of the output. The -k
flag means the sequence of the output lines should be in the same order as the order of the given attributes.
Expand | ||
---|---|---|
|
...
Expand | ||
---|---|---|
|
Info |
---|
The “getline” is a customized command, not a standard Linux command. It can be replaced with “sed” commands. |
In this step, the fmt11 output is the archive format, which is not human-readable but contains all of the query results. Then the desired columns are parsed from the archive file to a human-readable table format via the fmt6 output. The fmt11 file can be stored for future reference. This is one of the best practices for using BLAST, and more formatting info can be found here: https://www.ncbi.nlm.nih.gov/books/NBK279684/table/appendices.T.options_common_to_all_blast/
Benchmarking
This step is not a part of setting up the job array, but is very important for estimating a good wall time and core numbers. Although lots of sbatch parameters are given in the sbatch script above, they can be overwritten in the command line directly. In the commands below, -a
means the sub-job number in the job array, -c
will overwrite the core number required in the script.
So all the three commands below are using the first sub job to do the benchmarking, which is the first line in our manifest file, querying /scratch/spock/dataset/sample1.fasta
against db1
.
...
More cores can be tested. After all the tests are completed, we can decide on a the proper number of cores for each sub job (the -c value), and also the total wall time needed. The total wall time must be more than the wall time needed for a sub job.
Info |
---|
Blast According to our observations, BLAST does not scale linearly, more cores wouldn’t necessarily lead to shorter run time. |
Since we are using sbatch scripts to submit our jobs, we can also use some slurm commands to investigate the efficiency of these benchmarking runs:
Code Block |
---|
$ myacct
## This command will print out all the recent jobs run on your account.
$ seff <jobID>
## e.g. seff 12346788_1
## This command will print out the core utility rates and other job statistics. |
For the seff
command, You can make a short script to loop through all the sub-jobs, or just check a few of them. Note that the seff
command doesn’t produce reliable results for a running job, it should only be used on a completed job.
Run the entire job array
First, find out how many rows there are in the manifest file, it is the total sub-job number. For this example, we have totally a total 8 sub-jobs, so the command to submit the job array is:
...
The sbatch script runs from the submitting directory, the manifest file should be in the same directory of submission. After the run, there will be two output files generated for each fasta file, one is the fmt11
archive file, and the other is the readable fmt6
file. The path used in the codes need needs to be carefully changed to reflect the actual directory structure.
Run the failed sub-jobs individually
Sometimes, we may have failed sub-jobs among other successful sub-jobs in a job array. We will need to find the job ID of these failed jobs. To run them again, replace X with their sub-job ID in the command below:
Code Block |
---|
sbatch -a X blastx_fasta_array.sh manifest |
...
Case Two: From Loops to Parallel
Multiple loops can be time consuming to run serially, and since each loop is similar, a better solution to this problem, is to run each loop as a single sub job in a job array.
Design the workflow
For example, in machine learning, a single time of training set and testing set split is always not representative enough to avoid bias. Even a single time of k-fold cross validation sometimes is still not enough. A solution is to set a loop across some integers, and use each integer as the seed of the random state variable, and run multiple splits using different random seeds. Thus we can simply send out each loop as a sub job using the sub job ID as the random seed.
Using a job array can even do more. If multiple pre-processing methods and different machine learning models need to be compared against each other on the same input data, we can set up those choices as different attributes in our manifest file, and generate sub jobs for each combination.
This is a more complicated example, and we need a manifest file, a sbatch script, and a python script to run the actual machine learning training process. Please refer to case 1 for more explanations.
...
Generate the manifest file
Code Block |
---|
parallel -k echo {} ::: $(seq 1 10) ::: none smote adasyn ::: RF XGB > manifest |
The first attribute $(seq 1 10)
means printing out integers from 1 to 10. Those are our random seeds, which also means each combination of the pre-processing method and ML models will be repeated independently
...
10 times.
The second attribute is the three choices for pre-processing the dataset, which are doing nothing, the SMOTE algorithm for balancing labels, and the ADASYN algorithm for balancing labels.
The third attribute is the two
...
choices for ML algorithms, which are “RF” for Random Forest and “XGB” for XGBoost.
So
...
total we have 10*3*2=60 lines in the manifest file, or 60 sub-jobs in this job array.
Create the sbatch script
Expand | ||
---|---|---|
|
...