Easy Way To Get 3′ Utr Lengths Of A List Of Genes

Easy Way To Get 3′ Utr Lengths Of A List Of Genes

4

Hi, as the title says really, I’m wondering if there is any tool available that would allow me to drop in a list of say entrez gene ids and get their corresponding 3′ UTR lenghts?

Thanks for any suggestions.


utr

• 21k views

As an alternative web-based solution the following will give you 3′ UTR information for each transcript of a gene, but will require a little subtraction, try the Ensembl Biomart, e.g.

  1. Choose Database -> Ensembl Genes 60
  2. Choose Dataset -> Homo Sapiens Genes
  3. Click “Filters” on left hand menubar
  4. Expand “Gene” section by clicking “+”
  5. Select “ID list limit” check box.
  6. Select Entrez gene IDs from ID list limit drop down menu
  7. Paste in list of Entrez gene IDs
  8. Click “Attributes” on left hand menubar
  9. Click “Sequences” radio button
  10. Expand “Sequences” section by clicking “+”
  11. Check “3′ UTR” under “sequences” header
  12. Expand “Header information” section by clicking “+”
  13. Check “3′ UTR start” and “3′ UTR end” and “Transcript name” under “Transcript information” header
  14. Click “Results” button at top left.

This will give you a set of fasta files of 3′ UTRs for all transcripts for your set of Entrez gene IDs, which contain the start and stop of each 3′ UTR on genome coordinates. I believe this solution has the same problem of not accounting for introns in 3′ UTRs, but because of the gene<->transcript<->UTR mapping, it will account for alternative 3′ UTRs.

This can be done using the GenomicFeatures library from Bioconductor
(and dplyr)

I will use the refSeq transcripts (“refGene”) from mouse (“mm10”)

library(GenomicFeatures)
library(dplyr)

refSeq             <- makeTxDbFromUCSC(genom="mm10",tablename="refGene")                     
threeUTRs          <- threeUTRsByTranscript(refseq, use.names=TRUE)
length_threeUTRs   <- width(ranges(threeUTRs))
the_lengths        <- as.data.frame(length_threeUTRs)
the_lengths        <- the_lengths %>% group_by(group, group_name) %>% summarise(sum(value))
the_lengths        <- unique(the_lengths[,c("group_name", "sum(value)")])
colnames(the_lengths) <- c("RefSeq Transcript", "3' UTR Length")

The dataframe “the_lengths” has what you need.

The table KnownGene in the UCSC database contains all the information you want about the structure of the gene (the positions of the introns, exons, cdsStart/end , txStart/end).

The table kgXref contains the NCBI id and is linked to KnownGene.

for the genes on the ‘+’ strand the query would be (for rapidity, I won’t take in account any splicing between the last codon and the end of the transcription, it would need more code than a simple SQL query ):

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18
mysql>select distinct X.geneSymbol, K.txEnd-K.cdsEnd
 from
 kgXref as X,
 knownGene as K
 where
   X.geneSymbol!=""
   and K.name=X.kgId and
   K.strand="+" ;

+---------------+------------------+
| geneSymbol    | K.txEnd-K.cdsEnd |
+---------------+------------------+
| BC032353      |             3006 |
| AX748260      |             3157 |
| BC048429      |             1540 |
| OR4F5         |                0 |
| OR4F5         |                1 |
| DQ599874      |               31 |
| DQ599768      |               78 |
(..)

Hmmm, there is not typically alternative splicing of 3′-UTRs, but it can happen. There certainly are lots of examples of alternate terminal exons. So, I would not want to link gene symbol to 3′-UTR length, but rather gene symbol to mRNA identifier to its 3′-UTR length. Perhaps Pierre’s table above shows that for gene OR4F5, but a length of zero is not a good test for one gene with 2 mRNA isoforms and hence two different, or not, 3′-UTR lengths.

Just something to consider…


Login
before adding your answer.

Traffic: 2091 users visited in the last hour

Read more here: Source link