Easy Way To Get 3′ Utr Lengths Of A List Of Genes
Hi, as the title says really, I’m wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3′ UTR lenghts?
Thanks for any suggestions.
• 21k views
As an alternative web-based solution the following will give you 3′ UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.
- Choose Database -> Ensembl Genes 60
- Choose Dataset -> Homo Sapiens Genes
- Click “Filters” on left hand menubar
- Expand “Gene” section by clicking “+”
- Select “ID list limit” check box.
- Select Entrez gene IDs from ID list limit drop down menu
- Paste in list of Entrez gene IDs
- Click “Attributes” on left hand menubar
- Click “Sequences” radio button
- Expand “Sequences” section by clicking “+”
- Check “3′ UTR” under “sequences” header
- Expand “Header information” section by clicking “+”
- Check “3′ UTR start” and “3′ UTR end” and “Transcript name” under “Transcript information” header
- Click “Results” button at top left.
This will give you a set of fasta files of 3′ UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3′ UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3′ UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3′ UTRs.
This can be done using the GenomicFeatures library from Bioconductor
(and dplyr)
I will use the refSeq transcripts (“refGene”) from mouse (“mm10”)
library(GenomicFeatures)
library(dplyr)
refSeq <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")
threeUTRs <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs <- width(ranges(threeUTRs))
the_lengths <- as.data.frame(length_threeUTRs)
the_lengths <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")
The dataframe “the_lengths” has what you need.
The table KnownGene
in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).
The table kgXref
contains the NCBI id and is linked to KnownGene
.
for the genes on the ‘+’ strand the query would be (for rapidity, I won’t take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):
mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg18
mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd
from
kgXref as X,
knownGene as K
where
X.geneSymbol!=""
and K.name=X.kgId and
K.strand="+" ;
+---------------+------------------+
| geneSymbol | K.txEnd-K.cdsEnd |
+---------------+------------------+
| BC032353 | 3006 |
| AX748260 | 3157 |
| BC048429 | 1540 |
| OR4F5 | 0 |
| OR4F5 | 1 |
| DQ599874 | 31 |
| DQ599768 | 78 |
(..)
Hmmm, there is not typically alternative splicing of 3′-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3′-UTR length, but rather gene symbol to mRNA identifier to its 3′-UTR length. Perhaps Pierre’s table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3′-UTR lengths.
Just something to consider…
Traffic: 2091 users visited in the last hour
Read more here: Source link