Inferring homology from BLAST scores/statistics : bioinformatics


I have some proteins with blast homologs, and I am trying to get a quantitative measure of the representativeness of each match. As I understand, everyone normally compares blast alignments using bit scores, as these are database-size independent. However (please correct me if I’m wrong) bit-scores only describe the quality of the HSP itself, not how representative that HSP/bit score is of it’s parent protein.

Would I be barking up the wrong tree if I DIYed a score for comparison? One of the main reasons I’m asking is I’m not particularly hot on BLAST statistics (so this may all be unnecessary) and I know making up your own stats can be a bad idea.

My score would be something like:

(bitscore * hit_perc_coverage) / log (evalue)

This would hopefully approximate to:

(quality of HSP * HSP representativeness of protein) / reliability of HSP quality

NB – I would take the log to stop differences in E-value massively biasing the final score.

Thanks for reading!


Having thought about it a bit more, dividing by the log of the e-value would decrease the overall score when the e-value has larger absolute exponents (i.e. when the e-value decreases, assuming the e-value is below 1). I think this would be bad, as smaller/better e values would give smaller final scores than larger/worse e values, but you would be looking for higher scores being better from the perspective of the coverage/bit score. This means a better (but potentially still crap!) score would likely be something more like:

bitscore * hit_perc_coverage * log (evalue)

Read more here: Source link