UCSC knownCanonical hg19 vs. hg38


We have an FAQ page that covers this topic (genome.ucsc.edu/FAQ/FAQgenes.html#singledownload). As posted by ATpoint, it boils down to different datasets and different approaches.

hg19 knownCanonical was last updated in 2013 and built primarily from RefSeq and GenBank sequences and a few other sources. One isoform was identified from each gene (as defined by UCSC IDs) which was typically the longest isoform.

For hg38, knownCanonical was last built on the GENCODE v36 models earlier this year. In this case one canonical isoform was chosen per ENSEMBL gene ID. The hierarchy for which was chosen is described in the FAQ page.

So while the tables have the same name, they originate from different data and different designations of a gene.

A more recent and ‘standardized’ approach is to use NCBI’s RefSeq Select transcripts (www.ncbi.nlm.nih.gov/refseq/refseq_select/). These are NCBI’s pick of a single representative transcript for every protein-coding gene. if you compare these numbers across hg19 and hg38, you’ll see they are very similar:

#Assembly #tableName #count
hg19 ncbiRefSeqSelect 21461
hg38 ncbiRefSeqSelect 21763

Sometime in the near future the MANE project (www.ncbi.nlm.nih.gov/refseq/refseq_select/#MANE) will release a list of canonical transcripts for hg38 that are standardized between RefSeq and GENCODE.

If you have any follow up questions, our public help desk can always be reached at genome@soe.ucsc.edu. You may also send questions to genome-www@soe.ucsc.edu if they contain sensitive data. For any Genome Browser questions on Biostars, the UCSC tag is the best way to ensure visibility by the team.

Read more here: Source link