Convert Abricate output tsv file to gff3 format

Here’s one way using , that I think fulfills the requirements. It adds each of the column names (on the first line) to an array to make accessing each of the fields a bit easier. This approach isn’t strictly necessary, but it does make for a more readable solution in my opinion. With the following in a file called tsv2gff3.awk:

BEGIN {
    FS=OFS="\t"
    c=1
}

NR==1 && sub(/^#/, "") {
    for(i=1; i<=NF; i++) {
        a[$i]=i
    }

    print "##gff-version 3"
    next
}

{
    sub(/\.fna$/, "", $a["FILE"])
    gsub(/[\(\)\[\]]/, "", $a["PRODUCT"])

    id = "ID=" $a["FILE"] "_" c++
    product = "product=" $a["PRODUCT"]

    print \
        $a["FILE"],
        $a["DATABASE"],
        "CDS",
        $a["START"],
        $a["END"],
        ".",
        "+",
        "0",
        id ";" product
}

And the following in file.tsv:

#FILE   SEQUENCE    START   END GENE    COVERAGE    COVERAGE_MAP    GAPS    %COVERAGE   %IDENTITY   DATABASE    ACCESSION   PRODUCT
UBird_Cyou_D3.fna   BJCZ01000001.1  1866608 1867417 cdtB    1-810/810   =============== 0/0 100 90  vfdb    CAD48850    (cdtB) cytolethal distending toxin B [CDT (VF0185)] [Escherichia coli O157:H str. 493/89]
UBird_Cyou_D3.fna   BJCZ01000001.1  1867414 1868190 cdtA    1-777/777   =============== 0/0 100 90.61   vfdb    CAD48849    (cdtA) cytolethal distending toxin A [CDT (VF0185)] [Escherichia coli O157:H str. 493/89]
UBird_Cyou_D3.fna   BJCZ01000001.1  2245186 2246238 ompA    1-1041/1041 ========/====== 1/12    100 94.11   vfdb    AAF37887    (ompA) outer membrane protein A [OmpA (VF0236)] [Escherichia coli O18:K1:H7 str. RS218]

Run using awk -f tsv2gff3.awk file.tsv, results:

##gff-version 3
UBird_Cyou_D3   vfdb    CDS 1866608 1867417 .   +   0   ID=UBird_Cyou_D3_1;product=cdtB cytolethal distending toxin B CDT VF0185 Escherichia coli O157:H str. 493/89
UBird_Cyou_D3   vfdb    CDS 1867414 1868190 .   +   0   ID=UBird_Cyou_D3_2;product=cdtA cytolethal distending toxin A CDT VF0185 Escherichia coli O157:H str. 493/89
UBird_Cyou_D3   vfdb    CDS 2245186 2246238 .   +   0   ID=UBird_Cyou_D3_3;product=ompA outer membrane protein A OmpA VF0236 Escherichia coli O18:K1:H7 str. RS218

Read more here: Source link