Using Alphafold with Singularity

This is a work in progress and based on the Compute Canada documentation here at: https://docs.computecanada.ca/wiki/AlphaFold

 

Request access to the dataset (alphafold database)

/data/alphafold is now universally available without additional access/making a request. Any user already logged into Sol can access this openly simply by changing directory with cd /data/alphafold.

Create an account with Sylabs (free)

The build process for alphafold normally requires root access. The simplest method to circumvent this requirement is to use the Sylabs Builder tool.

As an alternative, the singularity image can be created on a user owned system which they have root/admin access to using the normal singularity build process

SSH to Agave

Using Putty or your SSH client of choice, SSH to agave

Once logged in you will be ready to start, if you are unable to login here you may need to request a Research Computing HPC account.

Build Singularity image

From your home directory, download the singularity image.

[asurite@agave ~]$ module load singularity/3.8.0 [asurite@agave ~]$ mkdir alphafold2 && cd alphafold2 [asurite@agave ~]$ singularity build --remote alphafold.sif docker://uvarc/alphafold:2.2.0

Before trying to run it, check our singularity documentation 

Running AlphaFold within Singularity

Here is an example to run the containerized version of alphafold2 on a given protein sequence. The protein sequence is saved in fasta format as below:

Fasta file placement & example

Note: you will need to create this file

[asurite@agave ~]$ cat alphafold2/input.fasta >5ZE6_1 MNLEKINELTAQDMAGVNAAILEQLNSDVQLINQLGYYIVSGGGKRIRPMIAVLAARAVGYEGNAHVTIAALIEFIHTATLLHDDVVDESDMRRGKATANAA FGNAASVLVGDFIYTRAFQMMTSLGSLKVLEVMSEAVNVIAEGEVLQLMNVNDPDITEENYMRVIYSKTARLFEAAAQCSGILAGCTPEEEKGLQDYGRYLG TAFQLIDDLLDYNADGEQLGKNVGDDLNEGKPTLPLLHAMHHGTPEQAQMIRTAIEQGNGRHLLEPVLEAMNACGSLEWTRQRAEEEADKAIAALQVLPDTP WREALIGLAHIAVQRDR

 

Finding the reference database folders

The reference databases and models were downloaded to predict the structure of the above protein sequence.

The directory name of the database will change as we pull new databases which are changed frequently. The reasoning behind this is that changing or deleting old databases while jobs are running may cause jobs to fail.

The directories will be named based on the date stamp of when the download was started in the form of db_YYYYmmdd

directories that do not have “dataset_access” in the group ownership, have not completed downloading and are not ready for use.

 

[asurite@agave ~]$ tree /data/alphafold/db_20210825 data/ ├── bfd │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex │   ├── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata │   └── bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex ├── mgnify │   └── mgy_clusters.fa ├── params │   ├── LICENSE │   ├── params_model_1.npz │   ├── params_model_1_ptm.npz │   ├── params_model_2.npz │   ├── params_model_2_ptm.npz │   ├── params_model_3.npz │   ├── params_model_3_ptm.npz │   ├── params_model_4.npz │   ├── params_model_4_ptm.npz │   ├── params_model_5.npz │   └── params_model_5_ptm.npz ├── pdb70 │   ├── md5sum │   ├── pdb70_a3m.ffdata │   ├── pdb70_a3m.ffindex │   ├── pdb70_clu.tsv │   ├── pdb70_cs219.ffdata │   ├── pdb70_cs219.ffindex │   ├── pdb70_hhm.ffdata │   ├── pdb70_hhm.ffindex │   └── pdb_filter.dat ├── pdb_mmcif │   ├── mmcif_files │   │   ├── 100d.cif │   │   ├── 101d.cif │   │   ├── 101m.cif │   │   ├── ... │   │   ├── ... │   │   ├── 9wga.cif │   │   ├── 9xia.cif │   │   └── 9xim.cif │   └── obsolete.dat ├── uniclust30 │   └── uniclust30_2018_08 │   ├── uniclust30_2018_08_a3m_db -> uniclust30_2018_08_a3m.ffdata │   ├── uniclust30_2018_08_a3m_db.index │   ├── uniclust30_2018_08_a3m.ffdata │   ├── uniclust30_2018_08_a3m.ffindex │   ├── uniclust30_2018_08.cs219 │   ├── uniclust30_2018_08_cs219.ffdata │   ├── uniclust30_2018_08_cs219.ffindex │   ├── uniclust30_2018_08.cs219.sizes │   ├── uniclust30_2018_08_hhm_db -> uniclust30_2018_08_hhm.ffdata │   ├── uniclust30_2018_08_hhm_db.index │   ├── uniclust30_2018_08_hhm.ffdata │   ├── uniclust30_2018_08_hhm.ffindex │   └── uniclust30_2018_08_md5sum └── uniref90 └── uniref90.fasta

 

Verifying before running the image

Let's say we want to run alphafold2 from the directory /home/ASURITE/alphafold2

Please make sure your singularity image you downloaded is also stored in this folder.

Creating a batch job to run alphafold

Alphafold2 launches a couple of multithreaded analyses using up to 8 CPUs before running model inference on the GPU. Memory requirements will vary with different size proteins. We created a batch input file for the above protein sequence as below.

Create an sbatch file

Create a file named alpharun_jobscript.submit and paste the following, updating for your specific instance.

The -p and -q options will need to be updated in the top section of this script.

In the second half of the script the ALPHAFOLD_DATA_PATH will need to be changed per the folders discovered in the previous steps in /data/alphafold

If you are using a different filename for your fast file be sure to update FASTA_FILE variable

If you used the directory structure suggested earlier in this document, nothing else will need to be changed, otherwise search and update any paths that are different

 

Reasoning in above sbatch script

For users looking to change or improve on this script here is a brief description for various lines in the script. This is not all-inclusive but just to clarify some items.

  • export VARIABLE

    • These are variables presumed to be most likely to change. This just makes it more convenient to adapt the script

  • singularity run "${opts[@]}"

    • This is telling singularity to run the image file provided with the large array of flags provided. For example, the --nv switch orders singularity to make any GPU's available usable inside the container and -B XXXX allows singularity to bind Agave directories to the container’s filesystem.

      • -B $ALPHAFOLD_DATA_PATH:/data maps the reference database to /data. This ensures inside the container the path to the reference database is consistent

      • -B .:/etc maps a folder in your home directory to /etc. The container needs /etc to be writable for several small files. If a non-writable path is used instead of the home directory, the container will not run

  • --preset

    • This is part of the directions provided by alphafold for the test & verification fasta file. Consult with your peers on what parameter is correct for your run

Running the batch script

On successful completion, the output directory should have the following files:

SLURM array run on multiple fasta files

There are many ways to do this. One implementation could have each fasta sequence on an individual line in a MASTER_LIST.txt file. In the sbatch script:

Then run sbatch with the --array flag. This example will submit 10 alphafold jobs on the first 10 lines of the MASTER_LIST.txt file: