Hello people,
I got stucked with my new script and perhaps you can help me.
Its goal is to take an input table with querys and subjects (originated by a local blast) and replace query names with subject names in the corresponding fasta file.
In detail, the table input file is this one:
query | subject | evalue |
---|---|---|
TransABySS.k25.S12100 |
lclMK087647.1_cds_QHD46952.1_6_[gene=atp9][protein=Atp9][exception=RNA_editing]_[protein_id=QHD46 | 0 |
and the fasta file:
>TransABySS.k25.S121005 892 14708
TTGTGTGACGAACGGAAGCAGGTATACGGGCTGTGGGCTGCCATCCTGCTGTTCCACCTGGGAGGGCCTGACAACATCAGTGCCTACTCCATAGCAGATGCGGAGCTCTGGAAGAGGCATGGCTTTCAGATGTTCTTCCAGGTGGTCACCGCTTTCTATGTAGTGTTCTCCTCAGCACACGGCACCATCCTCTGGATCAGCCTGGTTCTGGCTGCAGTGGGAACGGTGAAGTACGCCGAGCGGACTCTGGCTCTGTACCATGCTTCCGAGCACCATCTGGACACCATGGCCCGCCCCCTTTACAGATTGATGCAGTACGAGGATGTGTCTGGCGGATCCGGTGAGGAATACCACTACTTCCTGCTAGGAGAGAAGCGAGCATACGAGCACGCCTTCGACACTCCGAAATGGTACCACCAGGTGAGGGAATCCTTGCAGGAGCAGAAGTGGTTCAACAGCATGTTCGGCTGCCTGTTTTCTTCCCCTGCGCAGCCAGTGGAGTCCGACGCGCAGGCAGTGCAACGCAAGATCAGAGAGCCCTTCGAGGTGGTGATGCTCTCAGACGTGATGGATCCAGAGTACAGCACAGCACTGAAGCGGCAGTCTCCATGGAACGGCAGGAACTTGGTGGACATCTGCATAGCCTTTGCTCTCTTCAAGATGTTCCGCCGACGCCTCACCAGCCTGTACATGCACGAGTGGAGCAACGACAAGATCAGAAATTTCTTCATCATGCTGTGGGGAGAATCATCATCGTCATCCTCAGCCGAAGCACAACACCAAGACCCCACCCAAGCACCTGCAGAACCTCCAAGGGAAAGACTGGTGGACGTCCTGGACATGGAGCTCCGCTTCATGTTCGACAGCATGTTCACCAAGGCCTCAGGCACAG
The expected output should be:
>lcl|MK087647.1_cds_QHD46952.1_6_[gene=atp9]_[protein=Atp9]_[exception=RNA_editing]_[protein_id=QHD46
TTGTGTGACGAACGGAAGCAGGTATACGGGCTGTGGGCTGCCATCCTGCTGTTCCACCTGGGAGGGCCTGACAACATCAGTGCCTACTCCATAGCAGATGCGGAGCTCTGGAAGAGGCATGGCTTTCAGATGTTCTTCCAGGTGGTCACCGCTTTCTATGTAGTGTTCTCCTCAGCACACGGCACCATCCTCTGGATCAGCCTGGTTCTGGCTGCAGTGGGAACGGTGAAGTACGCCGAGCGGACTCTGGCTCTGTACCATGCTTCCGAGCACCATCTGGACACCATGGCCCGCCCCCTTTACAGATTGATGCAGTACGAGGATGTGTCTGGCGGATCCGGTGAGGAATACCACTACTTCCTGCTAGGAGAGAAGCGAGCATACGAGCACGCCTTCGACACTCCGAAATGGTACCACCAGGTGAGGGAATCCTTGCAGGAGCAGAAGTGGTTCAACAGCATGTTCGGCTGCCTGTTTTCTTCCCCTGCGCAGCCAGTGGAGTCCGACGCGCAGGCAGTGCAACGCAAGATCAGAGAGCCCTTCGAGGTGGTGATGCTCTCAGACGTGATGGATCCAGAGTACAGCACAGCACTGAAGCGGCAGTCTCCATGGAACGGCAGGAACTTGGTGGACATCTGCATAGCCTTTGCTCTCTTCAAGATGTTCCGCCGACGCCTCACCAGCCTGTACATGCACGAGTGGAGCAACGACAAGATCAGAAATTTCTTCATCATGCTGTGGGGAGAATCATCATCGTCATCCTCAGCCGAAGCACAACACCAAGACCCCACCCAAGCACCTGCAGAACCTCCAAGGGAAAGACTGGTGGACGTCCTGGACATGGAGCTCCGCTTCATGTTCGACAGCATGTTCACCAAGGCCTCAGGCACAG
Now, I already have a code written in python which does it but it is too slow. I post it here below in case anyone of you will prefer it
import pandas as pd from Bio import SeqIO
chloro_table=pd.read_csv(./db.chloro.agrestis.blastout’, sep=’t’, header=None, names=[‘qseqid’,’sseqid’, ‘evalue’])
with open (‘./new_names_chloro.fasta’, ‘a’) as for sequence in SeqIO.parse(./fasta.fa’, ‘fasta’): for i in range(0,len(chloro_table)): if chloro_table.iloc[i,0] == sequence.id: sequence.id = chloro_table.iloc[i,3] sequence.description=” SeqIO.write(sequence, outfile, ‘fasta’)
Now, using bash it becomes much more complicated. I tried with awk but with scarce success. Do you have any ide?
Thank you in advance
Read more here: Source link