Blast locally on multiple fasta files with multiple database

Blast locally on multiple fasta files with multiple database

2

Hi all,

I need to run blast locally on multiple fasta files contain in a directory. Previously, I have used command below:

ls *.fasta | parallel -a - blastp -query {} -db my_database -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {.}.tsv

Since I need to do pairwise comparison and need to blast all fasta files against each other, the command above will need to performe multiple time because I have to specify my database. Is it possible to have a command that is able to blast all fasta file against each other? And generate output file with file name that combined the name of my database plus the query file name?

Thank you.


sequence


genome

• 1.8k views

updated 1 hour ago by

106k

written 2.3 years ago by

▴

40

Something like this:

#!/bin/bash
set -euo pipefail

for q in *.fasta; do
  for s in *.fasta; do
    # Ignore self-comparison
    if [[ "${q}" != "${s}" ]]; then
      # Before blastp, check if exist or create database for ${s} (I leave this to you)
      blastp -query ${q} -db ${s} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out ${q}_2_${s}.tsv
    fi
  done
done

2.3 years ago by


AK

★

2.0k

I will borrow from SMK loop above to provide a GNU parallel solution.

First save the all vs all filenames into a file:

for q in *.fasta; do
  for s in *.fasta; do
    # Ignore self-comparison
    if [[ "${q}" != "${s}" ]]; then
      echo "${q} ${s}" >> allpairs.txt
    fi
  done
done

Then cat the filenames into parallel to build the multiple blast commands:

cat pairs.txt | parallel --colsep ' ' -j 1 
    'blastp -query {1} -subject {2} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {1.}_vs_{2.}.tsv'

If you have multiple processors cpu cores, you can use -j n to perform n blast searches simultaneously. You don’t need to build blast databases, blast can perform fasta vs fasta comparison with -query file1 -subject file2 , but in case each fasta file is large, then it would be best if you build the databases beforehand, and replace -subject {2} by -db {2.}.

edits / comments:

  • in view of my comment above, you should check if indeed -subject and -db behave the same way or not.
  • with older blast versions, scaling wasn’t very good and it was faster to perform n parallel searches than using -num_threads n. This may have improved in recent blast versions (current is BLAST+ 2.9.0), but I didn’t test.
  • keep in mind if you run several blast searches simultaneously with GNU parallel, memory usage will increase correspondingly compared to just one blast search.


Login
before adding your answer.

Traffic: 1935 users visited in the last hour

Read more here: Source link