Remove duplicates in fasta files based on a specific value with awk

Remove duplicates in fasta files based on a specific value with awk

1

I have a FASTA file organized as such:

>Prevalence_Sequence_ID:13|ARO_Name:AxyX|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAAGCAAAGAGTCCCTCTACGCACGTTCGTCCTATCTGCCGTATTAATTCTTATTACTGGTTGCTCGAAACCGGAAACCCAACCAGCCGCCGACGCCCCGGCGGAGAT
>Prevalence_Sequence_ID:14|ARO_Name:adeF|ARO:3004143|Detection_Model:Protein Homolog Model
ATGAATATCTCGAAATTCTTCATCGACCGGCCGATCTTCGCCGGCGTGCTTTCGATCCTGGTGTTGCTGGCGGGCATACTGGCCATGTTCCAGCTGCCCATTTCCGAGTACCCGGAAGTGGTGCCGCCGTCGGTGGTGGTGCGCGCGCAGTATCCGGGCGCCAACCCCAAGGTCATCGCCGAAACCGTGGCCTCGCCGCTGGAGGAG

I need to remove sequences that share the same ARO code (such as those above), keeping only one.
is there a simple solution to this problem using awk? In alternative, i can use python.


sort


fasta


sed


awk

• 53 views

updated 46 minutes ago by

0

written 2 hours ago by

▴

10

Read more here: Source link