Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

There are other ways to generate a manifest file with one to multiple columns, using parallel is one of the easiest ways. This line of code will mix and match the attributes {db1, db2, db3, db4} with all the fasta files in the given directory, then write the outputs line by line into a text file called manifest. Below is an example of the output. The -k flag means the sequence of the output lines should be the same order as the order of the given attributes.

Expand
Code Block
# parallel -k echo {} ::: db1 db2 db3 db4 ::: /scratch/spock/dataset/*.fasta > manifest

db1 /scratch/spock/dataset/sample1.fasta
db2 /scratch/spock/dataset/sample1.fasta
db3 /scratch/spock/dataset/sample1.fasta
db4 /scratch/spock/dataset/sample1.fasta
db1 /scratch/spock/dataset/sample2.fasta
db2 /scratch/spock/dataset/sample2.fasta
db3 /scratch/spock/dataset/sample2.fasta
db4 /scratch/spock/dataset/sample2.fasta
  1. Create the sbatch script

Expand
Code Block

...

#!/bin/bash
#SBATCH -c 8            # number of "cores"
#SBATCH -t 4:00:00     # time in d-hh:mm:ss
#SBATCH -p serial       # partition
#SBATCH -q normal       # QOS
#SBATCH -e slurm.%j.err # file to save job's STDERR (%j = JobId)
#SBATCH --export=NONE   # Purge the job-submitting shell environment
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=spock@asu.edu

# process the manifest file
manifest="${1:?ERROR -- must pass a manifest file}"
taskid=$SLURM_ARRAY_TASK_ID
case=$(getline $taskid $manifest | cut -f 1 -d ' ')
fasta=$(getline $taskid $manifest | cut -f 2 -d ' ')

# set up sbatch parameters and env
module purge
module load anaconda/py3
source activate bat

# set up input and output file names
base=$(basename -s .fasta $fasta)
out1=/scratch/name/"$base"_${case}_fmt11.txt
out2=/scratch/name/"$base"_${case}_fmt6.txt

# put all the blastx flags here just for the sake of formatting
args=(
 -query $fasta
 -db /scratch/name/path/db/$case/$case
 -evalue 1e-3
 -num_threads $(nproc)
 -max_target_seqs 5
 -max_hsps 1
 -outfmt '11'
 -out $out1
)
blastx "${args[@]}"

blast_formatter -archive $out1 -outfmt "6 qseqid sseqid evalue" -out $out2
  1. Benchmarking

This step is not a part of setting up job array, but very important for estimating a good wall time. Although lots of sbatch parameters are given in the sbatch script above, they can be overwritten in command line directly. In the commands below, -a means the sub job number in the job array, -c will overwrite the core number required in the script. So all the three commands below are using the first sub job to do the benchmarking, which is the first line in our manifest file, querying /scratch/spock/dataset/sample1.fasta against db1.

Code Block
sbatch -a 1 -c 1 blastx_fasta_array.sh manifest
sbatch -a 1 -c 4 blastx_fasta_array.sh manifest
sbatch -a 1 -c 8 blastx_fasta_array.sh manifest

More cores can be tested. After all the tests are completed, we can decide on a proper number of cores for each sub job (the -c value), and also the total wall time needed. The total wall time must be more than the wall time needed for a sub job.

  1. Run the entire job array

First find out how many rows there are in the manifest file, it is the total sub job number. For this example we have totally 8 sub jobs, so the command to submit the job array is:

Code Block
sbatch -a 1-8 blastx_fasta_array.sh manifest

The sbatch script runs from the submitting directory, the manifest file should be in the same directory of submission. After run, there will be two output files generated for each fasta file, one is fmt11 archive file, the other is the readable fmt6 file. The path used in the codes need to be carefully changed to reflect the actual directory structure.