The Evolution from HG19 to HG38

Welcome to another blog post!

Reference genomes are essential benchmarks of a species’ genome that facilitate the accurate comparison of individual genomes and are crucial tools for identifying genetic variants and diagnosing rare diseases. Here, we will explore the evolution of the human reference genome, focusing on the transition from HG19 to HG38, its advantages and limitations, and the implications of migrating for clinical diagnostics and genomic research. Additionally, we will explore how innovative tools like AION are facilitating the transition from HG19 to HG38 for improved diagnostic accuracy.

Understanding Reference Genomes and Their Significance for Genomics Research and Clinical Diagnostics

A reference genome is a nucleic acid sequence database assembled as an accepted representative example of the most up-to-date information on a species’ complete genome. Reference genomes are essential in genomics research, where they are used for studying genetic traits and diversity, and in clinical diagnostics, where they act as a universal standard for comparing genomes and diagnosing genetic diseases1. Consequently, it is imperative that reference genomes are as complete and representative as possible; however, most eukaryotic reference genomes do not have complete end-to-end sequence information2. In fact, the latest official Genome Reference Consortium (GRC)-released version of the human reference genome has hundreds of gaps, may not accurately reflect genetic diversity, and focuses on the euchromatic fraction of the genome1,2. Nevertheless, it is still accepted and widely used while ongoing efforts continue to construct, validate, and analyse a truly complete human reference genome3

The first human genome was drafted following the huge international effort of the Human Genome Project and has been continually improved in line with ongoing research efforts over the past two decades3,4. The human reference genome is curated by the GRC, the organisation responsible for ensuring the human reference genome is revised regularly in line with new findings5. The GRC releases a major update (a new version) every few years in addition to more regular, minor updates known as patches. The latest GRC-released version of the human genome is known as Genome Reference Consortium Human Build 38 (GRCh38), commonly referred to as HG38 or Build 386. HG38 is an upgraded version of the previous build, GRCh37, commonly known as HG19. In 2022, a more up-to-date version of the human reference genome was released by the Telomere-to-Telomere (T2T) Consortium3. This reference genome, T2T-CHM13, is the first gapless version of the human genome. Although T2T-CHM13 has not yet been formally adopted as the primary human reference genome, it represents a significant contribution to genomics and is anticipated to significantly enhance the current primary reference genome.

The Transition from HG19 to HG38: Advancements and Limitations

HG19, released in 2009, is an earlier human reference genome version. This version has been widely used in genomics projects, including long-term, global studies like the 1,000 Genomes Project7, as well as in clinical diagnostics. HG38 is a significantly upgraded version released in 2013, assembled using Sanger sequencing data from many donors8. HG38 has been employed for major genomics projects such as the 100,000 Genomes Project9,10, owing to the inclusion of several improvements over HG1911,12:

  • Updated Assembly: HG38 provides a more complete and accurate representation of the human genome, with fewer gaps and numerous sequence updates. HG38 altered 8,000 nucleotides and expanded the coverage to approximately 95% of the human genome, compared to ~92.5% in HG198,13. This enhanced accuracy is pivotal for clinical diagnostics, given the importance of precision in making accurate diagnoses and developing effective treatment plans. This is likely to be improved even further as clinicians move towards adopting the gapless T2T-CHM13 reference genome in the coming years3.

  • Alternate Haplotypes: In HG19, only three regions contained nine alternate locus sequences. Upon release, HG38 included 178 regions containing 261 alternative locus sequences, substantially improving the reference diversity and better reflecting genetic diversity, allowing more accurate diagnoses across diverse populations14,15.

  • Structural Variants: HG38 provides a better representation of structural variants and complex genomic regions, many of which were underrepresented in HG19. A study showed a 26.8% decrease in structural variant identification in HG38 compared to HG19, which they hypothesised was due to a reduction in the high rate of false positives associated with structural variant detection8,16.

  • Decoy Sequences: Integrating decoy sequences helps reduce false positives, improving the accuracy of genomic analyses. HG38 saw the inclusion of the hs38d1 decoy, which contains human genomic sequences that could not be placed on chromosomes when the reference genome was assembled, as well as the Epstein-Barr virus sequence17,18. Including the EBV genome is important to capture endogenous EBV that infects B cells in ~90% of the population and artefacts stemming from the immortalisation of human lymphocytes17.

The transition from HG19 to HG38 has significantly impacted genomic research, offering a more robust framework for understanding human genetics and its variations, which is likely to be improved even further with the adoption of T2T-CHM13. Moreover, the enhanced accuracy and representation of diversity provided with HG38 have considerable implications for rare disease diagnostics and precision medicine19,20.

While HG38 demonstrates significant improvements in representation and accuracy11, many scientists across research and clinical laboratories continue to use HG19 as their reference genome of choice11. Despite the increased risk of incorrect variant interpretation, which can be especially detrimental in diagnostic settings, HG19 allows labs to retain consistently throughout long-term projects and avoids altering existing analysis pipelines, since labs should avoid using two reference genomes at the same time8. Moreover, much of the existing literature reports HG19 coordinates, and initially, the lack of annotation tools for HG38 represented a considerable bottleneck8. Migrating to HG38 also presents complications regarding the realignment and conversion of previously generated data, which is technically complex and time- and resource-intensive19

It is likely that, with the increasing availability of advanced, accurate annotation and variant interpretation tools supporting HG38 and streamlining the transition, more researchers will be willing and able to make the switch. This will not only achieve more accurate genomic data and faster diagnosis and treatment but also improve consistency, avoid confusion, and reduce errors. 

Choosing Between HG19 and HG38 With AION

Researchers should carefully consider the version of the reference genome used in their work, as this can significantly impact the interpretation of genomic data, particularly in rare disease clinical diagnostics. Many bioinformatics tools support one or the other, but the flexibility to switch between the HG19 and HG38 reference genomes should not be underestimated, given the technical and practical benefits of each. 

AION is an AI-driven rare disease variant interpretation platform that supports the accelerated diagnosis of rare diseases using a machine-learning model trained on high-quality genetic variant data points. AION has been clinically validated on the Genomics England 100,000 Genomes Project, showing high sensitivity in identifying causative variants. AION has the advantage of allowing users the flexibility to choose between the HG19 and HG38 reference genomes simply by selecting one or the other when running a case. This capacity accommodates the nuances between different versions of the human reference genome, ensuring accurate variant interpretation and genomic analysis across different builds for each user’s unique project requirements. 

From a practical perspective, transitioning from HG19 to HG38 requires the lab to modify and validate their existing workflows for secondary and tertiary analysis, which can be burdensome in terms of time and costs. Moreover, lift-over, while possible, is usually imperfect, and variants may be detected in one genome but not the other. A 2021 study reported that only 7% of surveyed laboratories (including academic and clinical diagnostic labs) had transitioned to HG3821. Those still using HG19 cited time and monetary costs and lack of staff to support the migration as the main reasons for not yet migrating. AION is helping labs overcome these hurdles, allowing them to efficiently migrate to HG38 for more accurate, reliable clinical diagnostics. The lab just needs to adopt a secondary analysis strategy for HG38, and AION has the rest covered. In addition, if you need technical support transitioning to HG38, our expert team is here to help. 

Conclusions and Future Perspectives

In conclusion, the evolution of the human reference genome from HG19 to HG38 marks a significant advance in genomics, offering enhanced accuracy, better representation of genetic diversity, and incorporation of decoy sequences. While HG38 is the more updated and comprehensive version, HG19 continues to be heavily used due to barriers to transitioning, which are likely to remain barriers to adopting updated reference genomes in the future. Advanced tools like AION are working to overcome these barriers by facilitating the transition from HG19 to HG38 and providing expert support for labs looking to migrate. 

Moving forward, advanced long-read sequencing techniques will likely continue to facilitate the development of highly accurate and complete genome sequencing capabilities22, leading to more accurate, representative reference genomes3 and representing a huge step towards the more inclusive diagnosis of rare diseases. 

To learn more about how Nostos Genomics and our AI-driven variant interpretation platform, AION, can support your lab’s transition to HG38 for more accurate research and clinical diagnostics, book a free demo with one of our genomics experts. 

References

1. Wong KHY, Ma W, Wei CY, et al. Towards a reference genome that captures global genetic diversity. Nat Commun. 2020;11(1):5482. doi:10.1038/s41467-020-19311-w

2. A reference standard for genome biology. Nat Biotechnol. 2018;36(12):1121-1121. doi:10.1038/nbt.4318

3. Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44-53. doi:10.1126/science.abj6987

4. International Human Genome Sequencing Consortium, Whitehead Institute for Biomedical Research, Center for Genome Research:, Lander ES, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921. doi:10.1038/35057062

5. Genome Reference Consortium. Accessed January 24, 2024. www.ncbi.nlm.nih.gov/grc

6. Schneider VA, Graves-Lindsay T, Howe K, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res. 2017;27(5):849-864. doi:10.1101/gr.213611.116

7. The 1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491(7422):56-65. doi:10.1038/nature11632

8. Guo Y, Dai Y, Yu H, Zhao S, Samuels DC, Shyr Y. Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis. Genomics. 2017;109(2):83-90. doi:10.1016/j.ygeno.2017.01.005

9. 100,000 Genomes Project. Genomics England. Accessed January 24, 2024. www.genomicsengland.co.uk/initiatives/100000-genomes-project

10. Sosinsky A, Ambrose J, Cross W, et al. Insights for precision oncology from the integration of genomic and clinical data of 13,880 tumors from the 100,000 Genomes Cancer Programme. Nat Med. 2024;30(1):279-289. doi:10.1038/s41591-023-02682-0

11. Li H, Dawood M, Khayat MM, et al. Exome variant discrepancies due to reference-genome differences. Am J Hum Genet. 2021;108(7):1239-1250. doi:10.1016/j.ajhg.2021.05.011

12. Lowy-Gallego E, Fairley S, Zheng-Bradley X, et al. Variant calling on the GRCh38 assembly with the data from phase three of the 1000 Genomes Project. Wellcome Open Res. 2019;4:50. doi:10.12688/wellcomeopenres.15126.2

13. Zhao T, Duan Z, Genchev GZ, Lu H. Closing Human Reference Genome Gaps: Identifying and Characterizing Gap-Closing Sequences. G3 GenesGenomesGenetics. 2020;10(8):2801-2809. doi:10.1534/g3.120.401280

14. Church DM, Schneider VA, Steinberg KM, et al. Extending reference assembly models. Genome Biol. 2015;16(1):13. doi:10.1186/s13059-015-0587-3

15. Jäger M, Schubach M, Zemojtel T, Reinert K, Church DM, Robinson PN. Alternate-locus aware variant calling in whole genome sequencing. Genome Med. 2016;8(1):130. doi:10.1186/s13073-016-0383-z

16. Abel HJ, Duncavage EJ. Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches. Cancer Genet. 2013;206(12):432-440. doi:10.1016/j.cancergen.2013.11.002

17. Santpere G, Darre F, Blanco S, et al. Genome-Wide Analysis of Wild-Type Epstein–Barr Virus Genomes Derived from Healthy Individuals of the 1000 Genomes Project. Genome Biol Evol. 2014;6(4):846-860. doi:10.1093/gbe/evu054

18. Genovese G, Handsaker RE, Li H, Kenny EE, McCarroll SA. Mapping the Human Reference Genome’s Missing Sequence by Three-Way Admixture in Latino Genomes. Am J Hum Genet. 2013;93(3):411-421. doi:10.1016/j.ajhg.2013.07.002

19. Pan B, Kusko R, Xiao W, et al. Similarities and differences between variants called with human reference genome HG19 or HG38. BMC Bioinformatics. 2019;20(S2):101. doi:10.1186/s12859-019-2620-0

20. Pagnamenta AT, Camps C, Giacopuzzi E, et al. Structural and non-coding variants increase the diagnostic yield of clinical whole genome sequencing for rare diseases. Genome Med. 2023;15(1):94. doi:10.1186/s13073-023-01240-0

21. Lansdon LA, Cadieux-Dion M, Yoo B, et al. Factors Affecting Migration to GRCh38 in Laboratories Performing Clinical Next-Generation Sequencing. J Mol Diagn. 2021;23(5):651-657. doi:10.1016/j.jmoldx.2021.02.003

22. Method of the Year 2022: long-read sequencing. Nat Methods. 2023;20(1):1-1. doi:10.1038/s41592-022-01759-x

Read more here: Source link