gff3 – Extracting animo acid and nucleotide sequences from KofamScan output and codon alignment

I want to extract the amino acid sequences from KofamScan output, and my workflow is as attached in the picture:

For the analysis I am doing, I need to get the animo acid sequences, align them, and do codon alignment with the corresponding nucleotide sequences, so that I can get aligned nucleotides in the end.

Therefore, for each copy of the gene, I need both the nucleotide sequence and aa sequence. I am extracting these sequences using the .gff file of the genome where the gene can be found, and for each row on the .gff file, there is a distinct combination of the IDs from the .faa and .fna files, and there is start and end position of the target gene that I can use to extract the sequence on the .fna file. (as shown in the 4th step in the workflow)

  • e.g. NZ_AJQO01000119.1 is the ID from .fna file for nucleotides, and
    FIBHGADE03314 is the ID for aa in the .faa file. They are on the same
    line in the .gff file, and in that line, that start and end positions
    are 60 and 1025 respectively, using the ID NZ_AJQO01000119.1 (as shown in the last step on the workflow).
    Therefore, I extract the nucleotide sequences based on the info on
    start and end, and do codon alignment with the aligned aa sequences
    retrieved using the ID FIBHGADE03314.

However, the extracted nucleotide sequences do not seem correct, while the aa sequences looks good.

I would like to ask if I am interpreting the ID-ID correspondence wrongly? Or am I interpreting the start and end positions wrongly? Are the numbers inclusive/exclusive?

Thank you so much your help.

enter image description here

Read more here: Source link