Extracting exons and transcripts from gff3/gtf

I was just doing something similar about a week ago.
You may be able to accomplish this using the GenomicFeatures R package.

First load up the following in R:

library(GenomicFeatures)
library(GenomicRanges)
library(rtracklayer)

Then you will need to get the chromosome sizes file, which you can generate with directions from this post: Get chromosome sizes from fasta file (basically you need the fasta file of the genome, and then you use sam tools to get the chrominfo/chrom sizes file)

then read in that file into R with:

chrominfo <- read.table(file="your/file/path/sizes.genome", sep = 't')
colnames(chrominfo) <- c("chrom", "length")

then you should be able to plug it into GenomicFeatures using (you might have to download the gtf file instead from the NCBI link you provided, because I think GenomicFeatures only supports gff3 and gtf file formats):

purple.urchin.txdb <- GenomicFeatures::makeTxDbFromGFF(organism = "Strongylocentrotus purpuratus", 
                                              format = "gtf",
                                              file =  "~/your/path/here/GCF_000002235.5_Spur_5.0_genomic.gtf", 
                                              chrominfo = chrominfo)

and then you can get exons in bed format using (I am unsure if this follows your criteria for: (1) One record for each unique, non-overlapping exon):

exons <- exonsBy(purple.urchin.txdb, by = c("gene"))
exons <- unlist(exons)
rtracklayer::export(exons,'/your/file/path/here/exons.bed')

as for (2) One record for the longest transcript of each protein-coding gene:

transcripts <- transcriptsBy(purple.urchin.txdb, by = "gene")
transcripts <- unlist(transcripts)
rtracklayer::export(transcripts,'your/file/path/here/transcripts.bed')

Maybe someone could give an answer/comment with details on how to obtain the required criteria you need. But this is a start that maybe you could play around with.

I do have to note that I tried to make a txdb object for mouse using the Gencode vM27 GTF file and I don’t think I obtained all the elements when compared to just obtaining the txdb object from ensembl via makeTxDbFromEnsembl

With the above being said, there may be a way to make a txdb from ensembl directly…:

It may be something like this:

purple.urchin.txdb <- makeTxDbFromEnsembl(organism = "Strongylocentrotus purpuratus", server = "ensembldb.ensembl.org", username = "anonymous", port = "3337")

and then you could continue with exons <-

I do see that ensembl does have the information for it, just not sure exactly how to input it into makeTxDbFromEnsembl:
metazoa.ensembl.org/Strongylocentrotus_purpuratus/Info/Index?db=core

This might help you find the correct server address? useast.ensembl.org/info/data/mysql.html

Read more here: Source link