Answer: PopGenome – VCF, fasta, GTF and codons still missing

Dear Maciek

Hopefully you were able to solve these problems already.

I cannot comment on the main set of issues you reported. However, I also encountered the error: `Error in START[!REV, 3] : incorrect number of dimensions` following certain instances of `set.synnonsyn` which I also noticed occurred for genes which have only a single CDS listed in the gtf file.
There is a ‘dirty’ way to fix this which is to just ‘copy and paste’ any lone CDS in the gtf file, thereby adding a dummy CDS. As each dummy is a duplicate and therefore covers the same coordinates, it should not affect the number of coding SNPs detected.
Of course, this is not an elegant solution, but it enabled `set.synnonsyn` proceed without error for single-CDS genes.

First I read in the gtf and make a list of all ENSEMBL IDs of genes which have only one CDS:

library(ape)
library(dplyr)

gtf <- paste(“stickleback_v5_ensembl_genes_reformatted.gtf”)
allgenes <- read.gff(gtf, na.strings = c(“.”, “?”), GFF3 = FALSE)
allgenes<-subset(allgenes,feature==”CDS”)[9]
allgenes$attributes<-gsub(“ENS_gene=”,””,allgenes$attributes)
allgenes$attributes<-gsub(“;.*”,””,allgenes$attributes)
colnames(allgenes)[1]<-“ensembl_gene_id”
allgenes <- add_count(allgenes,ensembl_gene_id,name=”n_CDS”)
singles <- subset(allgenes,n_CDS==1)[1]

Then I get a subset of the gtf containing only these single CDS genes:

allgenes <- read.gff(gtf, na.strings = c(“.”, “?”), GFF …

Source link