Determining LOC coordinate from GFF3 start column

Determining LOC coordinate from GFF3 start column


Hi all, total noob question:

I have a GFF3 file of a pepper (C. annuum) plant genome that looks like this:

seqid   src     type    start   end 
chr01   PROTEIN gene    29119   37617   .       -       .       ID=CA.PGAv.1.6.scaffold567.122
chr01   PROTEIN mRNA    29119   37617   .       -       .       ID=TC.CA.PGAv.1.6.scaffold567.122;Parent=CA.PGAv.1.6.scaffold567.122
chr01   PROTEIN exon    29119   29457   .       -       0       Parent=TC.CA.PGAv.1.6.scaffold567.122
chr02   ABINITI gene    157637  159805  0.22    -       .       ID=CA.PGAv.1.6.scaffold1545.2
chr04   ISGAP   gene    11689   14256   1096    +       .       ID=CA.PGAv.1.6.scaffold638.93

I am trying to cross-reference the features in the GFF3 with the genes from this paper which identifies the locations with numbers such as “LOC107867643”, “LOC107868281” etc which I’m assuming are the absolute coordinates in their aligned sequence.

I’m assuming the “start” column is relative to the location of the seqid (because chr04 for example has a start less than chr02) and the spec.

My question is: how then do I translate the chr02 start 157637 for example to an absolute coordinate I can match up relative to the LOC numbers published in the paper?

For example, if the last feature for chr01 has an “end” of 309042759 and the first feature for chr02 has a “start” of 157637 can I just do 309042759 + 157637 = 309200396 to get the whole genome coordinate for that feature?

I found this Biostars question that noted if the chromosome was listed in the file it would start with 1 but I do not have any such entries in this file.

Any help would be great thanks



Read more here: Source link