Protein filtering for annotation

Protein filtering for annotation


I downloaded 1009 proteins from Genbank. After the below filtering I end up with 663 ammino acids sequences:

$ grep ">" NbenthamianaGenbankAA.fasta | grep -v partial | grep -v like | grep -v unnamed | wc -l

However, I noticed many identical descriptions but with different IDs and sequence lengths.

How do I choose the protein from the above example and are there any better filtering steps?

Thank you in advance,





Sequences with identical descriptions but different lengths are likely orthologs from different species.

You can filter them by sequence identity using CD-HIT. If you select a 90% identity cutoff, it will cluster together all the sequences that share 90+% identity, and keep a single (and longest) sequence from that cluster. That will likely take care of your partial sequences without having to grep them out.

It is up to you to select whatever identity threshold makes most sense. More sequences will be retained at higher cutoffs.

before adding your answer.

Traffic: 1868 users visited in the last hour

Read more here: Source link