makeblastdb creating multiple files of unexpectedly large sizes

I have a set of 100 amino acid sequences and I want to perform a BLASTP sesrch against the refseq_protein database. Accordingly I had set up the standalone version of BLAST (Version 2.11.0+) and downloaded the refseq_protein database from NCBI using the following code


The database gets downloaded as 3027 zipped files containing FASTA sequences. I unzipped all these files and concatenated them into a single file refseq_protein.faa (which is around 95 GB in size). Now when I run the following Python code

from Bio.Blast.Applications import NcbimakeblastdbCommandline
from Bio.Blast.Applications import NcbiblastpCommandline

cline = NcbimakeblastdbCommandline(dbtype = "prot", input_file = "D:\refseq_protein.faa", out 
= "refseq_protein")

blastp_cline = NcbiblastpCommandline(query = "D:\DEP_sequences.fasta', db = 
"refseq_protein", evalue = 0.01, outfmt = "7 sseqid evalue qcovs pident")


response = blastp_cline()

the NcbimakeblastdbCommandline function keeps creating multiple .phr, .pin, .psq etc files which take up a lot of space (In a demo run it had created ~30GB of these files and was still running). I’m afraid this will exhaust the entire space available on my internal hard drive. I’m wondering if there is a way to estimate the total size of the files which NcbimakeblastdbCommandline would create. This will help me in deciding whether or not to switch to an external storage to perform the BLASTP search.

I am aware of the fact that pre-formatted refseq_protein database exists but I’m not sure what value is to be passed in the db parameter of the NcbiblastpCommandline function, because it asks for the name of the database against which the BLASTP search is to be performed. In the approach that I chose, I had the liberty to set the name of the database.

Any suggestions on how to solve this issue would be appreciated.

Read more here: Source link