Remove duplicates in fasta files based on a specific value with awk
I have a FASTA file organized as such:
>Prevalence_Sequence_ID:13|ARO_Name:AxyX|ARO:3004143|Detection_Model:Protein Homolog Model ATGAAGCAAAGAGTCCCTCTACGCACGTTCGTCCTATCTGCCGTATTAATTCTTATTACTGGTTGCTCGAAACCGGAAACCCAACCAGCCGCCGACGCCCCGGCGGAGAT >Prevalence_Sequence_ID:14|ARO_Name:adeF|ARO:3004143|Detection_Model:Protein Homolog Model ATGAATATCTCGAAATTCTTCATCGACCGGCCGATCTTCGCCGGCGTGCTTTCGATCCTGGTGTTGCTGGCGGGCATACTGGCCATGTTCCAGCTGCCCATTTCCGAGTACCCGGAAGTGGTGCCGCCGTCGGTGGTGGTGCGCGCGCAGTATCCGGGCGCCAACCCCAAGGTCATCGCCGAAACCGTGGCCTCGCCGCTGGAGGAG
I need to remove sequences that share the same ARO code (such as those above), keeping only one.
is there a simple solution to this problem using awk? In alternative, i can use python.
• 53 views
Read more here: Source link