Replace fasta header using bash : bioinformatics

Hello people,

I got stucked with my new script and perhaps you can help me.

Its goal is to take an input table with querys and subjects (originated by a local blast) and replace query names with subject names in the corresponding fasta file.

In detail, the table input file is this one:

query subject evalue
TransABySS.k25.S12100 lclMK087647.1_cds_QHD46952.1_6_[gene=atp9][protein=Atp9][exception=RNA_editing]_[protein_id=QHD46 0

and the fasta file:

>TransABySS.k25.S121005 892 14708

TTGTGTGACGAACGGAAGCAGGTATACGGGCTGTGGGCTGCCATCCTGCTGTTCCACCTGGGAGGGCCTGACAACATCAGTGCCTACTCCATAGCAGATGCGGAGCTCTGGAAGAGGCATGGCTTTCAGATGTTCTTCCAGGTGGTCACCGCTTTCTATGTAGTGTTCTCCTCAGCACACGGCACCATCCTCTGGATCAGCCTGGTTCTGGCTGCAGTGGGAACGGTGAAGTACGCCGAGCGGACTCTGGCTCTGTACCATGCTTCCGAGCACCATCTGGACACCATGGCCCGCCCCCTTTACAGATTGATGCAGTACGAGGATGTGTCTGGCGGATCCGGTGAGGAATACCACTACTTCCTGCTAGGAGAGAAGCGAGCATACGAGCACGCCTTCGACACTCCGAAATGGTACCACCAGGTGAGGGAATCCTTGCAGGAGCAGAAGTGGTTCAACAGCATGTTCGGCTGCCTGTTTTCTTCCCCTGCGCAGCCAGTGGAGTCCGACGCGCAGGCAGTGCAACGCAAGATCAGAGAGCCCTTCGAGGTGGTGATGCTCTCAGACGTGATGGATCCAGAGTACAGCACAGCACTGAAGCGGCAGTCTCCATGGAACGGCAGGAACTTGGTGGACATCTGCATAGCCTTTGCTCTCTTCAAGATGTTCCGCCGACGCCTCACCAGCCTGTACATGCACGAGTGGAGCAACGACAAGATCAGAAATTTCTTCATCATGCTGTGGGGAGAATCATCATCGTCATCCTCAGCCGAAGCACAACACCAAGACCCCACCCAAGCACCTGCAGAACCTCCAAGGGAAAGACTGGTGGACGTCCTGGACATGGAGCTCCGCTTCATGTTCGACAGCATGTTCACCAAGGCCTCAGGCACAG

The expected output should be:

>lcl|MK087647.1_cds_QHD46952.1_6_[gene=atp9]_[protein=Atp9]_[exception=RNA_editing]_[protein_id=QHD46

TTGTGTGACGAACGGAAGCAGGTATACGGGCTGTGGGCTGCCATCCTGCTGTTCCACCTGGGAGGGCCTGACAACATCAGTGCCTACTCCATAGCAGATGCGGAGCTCTGGAAGAGGCATGGCTTTCAGATGTTCTTCCAGGTGGTCACCGCTTTCTATGTAGTGTTCTCCTCAGCACACGGCACCATCCTCTGGATCAGCCTGGTTCTGGCTGCAGTGGGAACGGTGAAGTACGCCGAGCGGACTCTGGCTCTGTACCATGCTTCCGAGCACCATCTGGACACCATGGCCCGCCCCCTTTACAGATTGATGCAGTACGAGGATGTGTCTGGCGGATCCGGTGAGGAATACCACTACTTCCTGCTAGGAGAGAAGCGAGCATACGAGCACGCCTTCGACACTCCGAAATGGTACCACCAGGTGAGGGAATCCTTGCAGGAGCAGAAGTGGTTCAACAGCATGTTCGGCTGCCTGTTTTCTTCCCCTGCGCAGCCAGTGGAGTCCGACGCGCAGGCAGTGCAACGCAAGATCAGAGAGCCCTTCGAGGTGGTGATGCTCTCAGACGTGATGGATCCAGAGTACAGCACAGCACTGAAGCGGCAGTCTCCATGGAACGGCAGGAACTTGGTGGACATCTGCATAGCCTTTGCTCTCTTCAAGATGTTCCGCCGACGCCTCACCAGCCTGTACATGCACGAGTGGAGCAACGACAAGATCAGAAATTTCTTCATCATGCTGTGGGGAGAATCATCATCGTCATCCTCAGCCGAAGCACAACACCAAGACCCCACCCAAGCACCTGCAGAACCTCCAAGGGAAAGACTGGTGGACGTCCTGGACATGGAGCTCCGCTTCATGTTCGACAGCATGTTCACCAAGGCCTCAGGCACAG

Now, I already have a code written in python which does it but it is too slow. I post it here below in case anyone of you will prefer it

import pandas as pd from Bio import SeqIO

chloro_table=pd.read_csv(./db.chloro.agrestis.blastout’, sep=’t’, header=None, names=[‘qseqid’,’sseqid’, ‘evalue’])

with open (‘./new_names_chloro.fasta’, ‘a’) as for sequence in SeqIO.parse(./fasta.fa’, ‘fasta’): for i in range(0,len(chloro_table)): if chloro_table.iloc[i,0] == sequence.id: sequence.id = chloro_table.iloc[i,3] sequence.description=” SeqIO.write(sequence, outfile, ‘fasta’)

Now, using bash it becomes much more complicated. I tried with awk but with scarce success. Do you have any ide?

Thank you in advance

Read more here: Source link