BLAST identifies homologous sequences using a heuristic method which
initially finds short matches between two sequences; thus, the method
does not take the entire sequence space into account. After initial match,
BLAST attempts to start local alignments from these initial matches. This
also means that BLAST does not guarantee the optimal alignment, thus
some sequence hits may be missed. In order to find optimal alignments, the
Smith-Waterman algorithm should be used. In the following, the BLAST
algorithm is described in more detail.
Parameters used in BLAST algorithm :
Threshold: It is a boundary of minimum or maximum value which can be
used to filter out words during comparison.
True Homology: In BLAST true homology refers how much the sequence is
similar to the query sequence.
E-value : It decreases exponentially with the score that is assigned to an
alignment between two sequences.
Word size: Whole Search is done by taking the sequence of a certain word
size and compares it with the database sequence and scores are assigned
for each comparison. Word size is given as 11 for nucleic acids and 3 for
Putative conserved domains : These are the domains that have different
Gap score or gap penalty: Dynamic programming algorithms uses gap
penalties to maximize the biological meaning. Gap penalty is subtracted for
each gap that has been introduced. There are different gap penalties such
as gap open and gap extension. The gap score defines, a penalty given to
alignment when we have insertion or deletion. During the evolution, there
may be a case where we can see continuous gaps all along the sequence, so
the linear gap penalty would not be appropriate for the alignment. Thus gap
open and gap extension has been introduced when there are continuous
gaps (five or more). The open penalty is always applied at the start of the
gap, and then the other gaps following it is given with a gap extension
penalty which will be lesser compared to the open penalty. Typical values
are –12 for gap opening, and –4 for gap extension.
Working of BLAST Algorithm:
- Query sequence is taken and analyzed for low complex regions. Low
complexity regions are regions which contain less information or
variations like AAAAAAAA or ATATATAT etc.
These low complex regions are masked with alphabet s like X or N
List of words of certain word size is made. Usually the word size is 3
for proteins and 11 for DNA
Scores are calculated for each pair of words(query sequence word
and database word) using substitution scoring matrixes (like PAM or
BLOSUM),and only the high scoring words i.e. above a threshold
value or a cutoff score is taken for further alignment. A cutoff score is
selected to reduce number the number of matches so as to decrease
the computation time.
This scoring and checking is repeated for all the words in the query
The remaining high-scoring words are organised into efficient search
tree and rapidly compared to the database sequence. This is done to
find out the exact matches.
– If an exact or good match is found then an alignment is extended in
both directions from the position where the exact match occurred
- High scoring pairs (HSP) which have score greater than a threshold
are taken for consideration.
– Significance of the HSP score are calculated.
- Statistical assessments are made in the case if two or more HSP
regions are found and certain matching pairs are put in descending
order in the output file as far as their similarity/ score is concerned.
Read more here: Source link