Alternate nucleotide is more frequent than reference nucleotide. OMG I’m dizzy. How do I stop the twirl?

This is due to the fact that the very reference genomes that we use for re-alignment are themselves based on individuals who carry rare risk alleles. Thus, when we call variants against these genomes, we are, at many loci, comparing against rare disease risk alleles.

As the best/worst example (depending on your point of view), hg19 / GRCh37 was used for more than a decade as the primary reference genome, yet ~70% of the genomic sequence of this genome was based on a single individual from the Buffalo area, New York, USA. Amongst the many 1 000s of rare disease susceptibility alleles that this individual carried was one called Factor V Leiden, which statistically significantly increases the risk of deep vein thrombosis (DVT). If you’re researching DVT (I was), you have to be aware of this.

Thus, if I perform exome-seq on an individual who does not have Factor V Leiden and re-align the data to hg19 / GRCh37, the Factor V Leiden variant position will show a SNV because the reference allele in my patient sample (which doesn’t increase risk of DVT) is being compared against the disease allele that’s contained in the very reference genome against which I’m re-aligning my data. Without careful screening, I may assume that my patient has increased risk of DVT, erroneously so.

There was a publication on this listed in PubMed but it’s very difficult to find, even by Google. It’s a critical problem yet has not received the attention that it deserves.

Edit June 2, 2021: much later, I found it: THE REFERENCE HUMAN GENOME DEMONSTRATES HIGH RISK OF TYPE 1 DIABETES AND OTHER DISORDERS

The situation improved with hg38 / GRCh38, as this reference build was based on much more individuals, but the same problems still persist, broadly speaking.

So, you really have to get to know your target panel and all of these nuances related to whatever variants you’re studying., particularly if you’re dealing with live patient data.

Kevin



Update 3rd January 2018

It has come to my attention that there is an automated method to search for these types of variants in your VCF:

Read more here: Source link