Improving conversion of abricate tsv file to gff3 file

Since such a neat solution (abricate tsv to gff3) was provided by Steve, here are few other steps that I am looking to add so that the script progress to logical maturity to be usable by many others.

I have two files – (1) fasta file with .fna extension, and (2) a data file with .tsv extension generated from abricate.

I need to write a gff3 file (containing sequences) by combining these two files (.fna and .tsv)

A .gff3 file generally has three or four components as below. Details here.

  1. Declaration of file type and version
  2. Declaration of sequence region(s) starting from Name of the sequence, followed by single space, then 1 separated by single space, ending in numerical value of the sequence length of the fasta sequence (which is the .fna file).
  3. Sequence features
  4. Fasta sequence (optional)

See below a truncated example

##gff-version 3

##sequence-region NZ_BHEI01000001 1 13890

NZ_BHEI01000001 Prodigal CDS 29 1588 . + 0 ID=CEMHHHDG_00001;product=hypothetical protein

NZ_BHEI01000001 Prodigal CDS 1585 2289 . + 0 ID=CEMHHHDG_00002;product=Putative beta-lactamase HcpC

….

##FASTA

>NZ_BHEI01000001

AAAAAAAGAATGCTTGCCTACTGGAGTAGTGCCTGTTACTTTTAAACCCACTTTTTTGCG

TTATGCGGATGAGTATTTTTTAAGAGTTGAATTTCAAGATGGAAGTGATGAAATCACTCA

TATAGAGGAGTTGGCAAAATACACAGATCAAA……….

Notice that the sequence names has to match in every instance (declaration of sequence regions, sequence features, and the fasta header at the end)

The .tsv to .gff3 conversion was elegantly solved by Steve.

Now we need to add the declaration of sequence region(s) in the .gff3 file. The declaration comprises of

“##sequence-region”[space]the input file name without the file extension .fna[space]sequence start[space]sequence length[\n]

For the last part, I have used the following awk code on the .fna file to generate the length in desired format.

awk '/^>/ {if (seqlen); seqlen=0;next; } { seqlen += length($0)}END{print "1 " seqlen}' example.fna

This output needs to be appended in the declaration of sequence region line.

Now we need to append the fasta sequence in the following format

##FASTA

>Sequence name as in the parts above in the .gff3 file

AAAAAAAGAATGCTTGCCTACTGGAGTAGTGCCTGTTACTTTTAAACCCACTTTTTTGCG

TTATGCGGATGAGTATTTTTTAAGAGTTGAATTTCAAGATGGAAGTGATGAAATCACTCA

TATAGAGGAGTTGGCAAAATACACAGATCAAA……….

Is there any help in sight, please?

Read more here: Source link