Blast locally on multiple fasta files with multiple database
Hi all,
I need to run blast locally on multiple fasta files contain in a directory. Previously, I have used command below:
ls *.fasta | parallel -a - blastp -query {} -db my_database -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {.}.tsv
Since I need to do pairwise comparison and need to blast all fasta files against each other, the command above will need to performe multiple time because I have to specify my database. Is it possible to have a command that is able to blast all fasta file against each other? And generate output file with file name that combined the name of my database plus the query file name?
Thank you.
• 1.8k views
Something like this:
#!/bin/bash
set -euo pipefail
for q in *.fasta; do
for s in *.fasta; do
# Ignore self-comparison
if [[ "${q}" != "${s}" ]]; then
# Before blastp, check if exist or create database for ${s} (I leave this to you)
blastp -query ${q} -db ${s} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out ${q}_2_${s}.tsv
fi
done
done
I will borrow from SMK loop above to provide a GNU parallel solution.
First save the all vs all filenames into a file:
for q in *.fasta; do
for s in *.fasta; do
# Ignore self-comparison
if [[ "${q}" != "${s}" ]]; then
echo "${q} ${s}" >> allpairs.txt
fi
done
done
Then cat the filenames into parallel to build the multiple blast commands:
cat pairs.txt | parallel --colsep ' ' -j 1
'blastp -query {1} -subject {2} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {1.}_vs_{2.}.tsv'
If you have multiple processors cpu cores, you can use -j n
to perform n
blast searches simultaneously. You don’t need to build blast databases, blast can perform fasta vs fasta comparison with -query file1 -subject file2
, but in case each fasta file is large, then it would be best if you build the databases beforehand, and replace -subject {2}
by -db {2.}
.
edits / comments:
- in view of my comment above, you should check if indeed
-subject
and-db
behave the same way or not. - with older blast versions, scaling wasn’t very good and it was faster to perform
n
parallel searches than using-num_threads n
. This may have improved in recent blast versions (current is BLAST+ 2.9.0), but I didn’t test. - keep in mind if you run several blast searches simultaneously with GNU parallel, memory usage will increase correspondingly compared to just one blast search.
Traffic: 1935 users visited in the last hour
Read more here: Source link