A draft human pangenome reference

Sample selection

We identified parent–child trios from the 1KG in which the child cell line banked within the NHGRI Sample Repository for Human Genetic Research at the Coriell Institute for Medical Research was listed as having zero expansions and two or fewer passages, and rank-ordered representative individuals as follows. Loci with MAFs less than 0.05 were removed. MAFs were measured in the full cohort (that is, 2,504 individuals, 26 subpopulations) regardless of each individual’s subpopulation labelling. For each chromosome, principal component analysis (PCA) was performed for dimension reduction. This resulted in a matrix with 2,200 features, which was then centred and scaled using smartPCA normalization. The matrix was further reduced to 100 features through another round of PCA.

We defined the representative individuals of a subpopulation as those who are similar to the other members in the group (which, in this scenario, is the subpopulation they belong to), as well as different from individuals outside the group. Group is defined by previous 1KG population labels (for example, ‘Gambian in Western Division’). We did this as follows. For each sample, we first calculated the intragroup distance dintra, which is the average of L2-norms between the sample and samples of the same subpopulation. The intergroup distance, dinter, was similarly defined as the average of L2-norms between the sample and samples from all other subpopulations. The L2-norms were derived in the feature space of the PCA. The score of this sample was then defined as 10 × dintra + dinter/(n 1), where n is the number of subpopulations. For each subpopulation, if fewer than three trios were available, all were selected. Otherwise, trios were sorted by ranking children with max(paternalrank, maternalrank), where paternalrank and maternalrank are the respective ranks of each parent’s score, selecting the three trios with a maximum value. We ranked by parent scores because during the year 1 effort, the child samples did not have sequencing data and therefore had to be represented by the parents.

Ideally, we would have selected the same number of candidates from each subpopulation and have an equal number of candidates from both sexes. To correct for imbalances, we applied the following criteria for each subpopulation’s candidate set: (1) when the sex was unbalanced (that is, off by more than one sample), we tried to swap in the next-best candidate of the less represented sex or did nothing if this was not possible; (2) if a subpopulation had fewer individuals than the desired sample selection size (that is, all candidates were selected), their unused slots were distributed to other unsaturated subpopulations. The latter choice is arbitrary but should have little impact on the overall results.

The genetic information used in this study was derived from publicly available cell lines from the NHGRI Sample Repository for Human Genetic Research and the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research. Therefore, this study is exempt from human research approval as the proposed work involved the collection or study of data or specimens that are already publicly available.

Sequencing

Cell line expansion and banking

Lymphoblastoid cell lines (LCLs) used for sequencing from the 1KG collection (Supplementary Table 1) were obtained from the NHGRI Sample Repository for Human Genetic Research at the Coriell Institute for Medical Research. HG002 (GM24385) and HG005 (GM24631) LCLs were obtained from the NIGMS Human Genetic Cell Repository at the Coriell Institute for Medical Research. All expansions for sequencing were derived from the original expansion culture lot to ensure the lowest possible number of passages and to reduce overall culturing time. Cells used for HiFi, Nanopore, Omni-C, Strand-seq, 10x Genomics and Bionano production and for g-banded karyotyping and Illumina Omni2.5 microarray were expanded to a total culture size of 4 × 108 cells, which resulted in a total of five passages after cell line establishment. Cells were split into production-specific sized vials as follows: HiFi, 2 × 107 cells; Nanopore, 5 × 107 cells; Omni-C, 5 × 106 cells; Strand-seq, 1 × 107 cells; 10x Genomics, 4 × 106 cells; and Bionano, 4 × 106 cells. Cells for Strand-seq were stored in 65% RPMI-1640, 30% FBS and 5% DMSO and frozen as viable cultures. All other cells were washed in PBS and flash-frozen as dry cell pellets. Cells used for ONT-UL production were separately expanded from the original expansion culture lot to a bank of five vials of 5 × 106 cells. A single vial was subsequently expanded to a total culture size of 4 × 108 cells, which resulted in a total of eight passages. Cells were also reserved for g-banded karyotyping and Illumina Omni2.5 microarray.

Karyotyping and microarray

G-banded karyotype analysis was performed on 5 × 106 cells collected at passage five (for HiFi, Nanopore and Omni-C) and passage eight (for ONT-UL). For all cell lines, 20 metaphase cells were counted, and a minimum of 5 metaphase cells were analysed and karyotyped. Chromosome analysis was performed at a resolution of 400 bands or greater. A pass/fail criterion was used before cell lines proceeded to sequencing. Cell lines with normal karyotypes (46,XX or 46,XY) or lines with benign polymorphisms that are frequently seen in apparently healthy individuals were classified as passes. Cell lines were classified as failures if two or more cells harboured the same chromosomal abnormality. DNA used for microarray was isolated from frozen cell pellets (3 × 106 to 7 × 106 cells) using a Maxwell RSC Cultured Cells DNA kit on a Maxwell RSC 48 instrument (Promega). DNA was genotyped at the Children’s Hospital of Philadelphia’s Center for Applied Genomics using an Infinium Omni2.5-8 v.1.3 BeadChip (Illumina) on an iScan System instrument (Illumina).

HiFi sequencing

PacBio HiFi sequencing was distributed between two centres: Washington University in St. Louis and the University of Washington. We describe the protocols used at each centre separately.

Washington University in St. Louis

High-molecular-weight DNA was isolated from frozen cell pellets using a Qiagen MagAttract HMW DNA kit and sheared using a Diagenode Megaruptor I to 20 kb mode size. At all steps, DNA quantity was checked on a Qubit Fluorometer I with a dsDNA HS Assay kit (Thermo Fisher), and sizes were examined on a FEMTO Pulse (Agilent Technologies) using a Genomic DNA 165 kb kit. SMRTbell libraries were prepared for sequencing according to the protocol ‘Procedure & Checklist—Preparing HiFi SMRTbell Libraries using the SMRTbell Express Template Prep Kit 2.0’. After SMRTbell generation, material was size-selected on a SageELF system (Sage Science) using the ‘0.75% 1-18 kb’ program (target 3,450 bp in well 12), and some combinations of fraction 3 (average size of 15–21 kb), fraction 2 (average size of 16–27 kb) and fraction 1 (average size of 20–31 kb) were selected for sequencing, depending on the empirical size measurements and available mass. The selected library fractions were bound with Sequencing Primer v.2 and Sequel II Polymerase v.2.0 and sequenced on Sequel II instruments (PacBio) on SMRT Cells 8M using Sequencing Plate v.2.0, diffusion loading, 2 h of pre-extension and 30 h of movie times. Samples were sequenced to a minimum HiFi data amount of 108.5 Gbp (35× estimated genome coverage) on four SMRT Cells.

University of Washington

High-molecular-weight DNA was isolated from frozen cell pellets using a modified Gentra Puregene method and sheared using gTUBE (Covaris) to 20 kb mode size. At all steps, DNA quantity was checked by fluorometry on a DS-11 FX instrument (DeNovix) with a Qubit dsDNA HS Assay kit (Thermo Fisher), and sizes were examined on a FEMTO Pulse (Agilent Technologies) using a Genomic DNA 165 kb kit. SMRTbell libraries were prepared for sequencing according to the protocol ‘Procedure & Checklist—Preparing HiFi SMRTbell Libraries using the SMRTbell Express Template Prep Kit 2.0’. After SMRTbell generation, material was size-selected on a SageELF system (Sage Science) using the ‘0.75% 1–18 kb’ program (target 3,400 bp in well 12), and fraction 2 (average size of 17–20 kb) or fraction 1 (average size of 18–20 kb) was selected for sequencing, depending on the empirical size measurements and available mass. For some samples, the SageELF program ‘0.75% agarose, 10 kb–40 kb’ (target 10,000 bp in well 10) was used, and fractions 6 and 7 were pooled together for sequencing (average size of 17–21 kb). The selected library fractions were bound with Sequencing Primer v.2 and Sequel II Polymerase v.2.0 and sequenced on Sequel II instruments (PacBio) on SMRT Cells 8M using Sequencing Plate v.2.0, diffusion loading, 3–4 h of pre-extension and 30 h of movie times. Samples were sequenced to a minimum HiFi data amount of 96 Gbp (30× estimated genome coverage) on at least four SMRT Cells.

Comparisons of HiFi production methods

Although subtle differences in HiFi data production methods existed between the University of Washington and Washington University in St. Louis, the resulting data were highly similar, with overlapping assembly statistics from most samples. These initial genomes were sequenced at a time when methods were being refined and optimized for HiFi sequencing, as it was a relatively new process. The primary differences in protocols are part of the nucleic acid isolation, fragmentation and size selection, with the downstream sequencing-specific applications being more consistent. Both teams were closely engaged with each other as well as with our company associates, including New England Biolabs (NEB), Qiagen, Diagenode and Sage Science, to provide optimal end products.

Nanopore ultra-long sequencing protocol

For the 18 additional samples, we used the nanopore unsheared long-read sequencing protocol16. This generated about 60× coverage of unsheared sequencing from 3 PromethION flow cells and a N50 value of around 44 kb. For the 29 newly selected HPRC samples (Results), we used the protocol outlined below.

DNA extraction

Around 50 million cells in a pellet were resuspended in 200 µl of PBS, and the resuspended cells were aliquoted (40 µl) into five 1.5 ml DNA Lo-bind Eppendorf tubes. The following procedure for DNA extraction was completed for each of the five aliquots. Each tube contained sufficient DNA for three libraries loaded onto one flow cell. The following reagents were added in sequence to each tube with pipette mixing (10 times up and down) using a P200 wide-bore pipette: 40 μl of proteinase K, 40 μl of buffer CS and 40 μl of CLE3. The samples were then incubated at room temperature (18–25 °C) for 30 min. Next, 40 μl of RNase A was added to each tube with pipette mixing (10 times) with a P200 wide-bore pipette, and samples were incubated at room temperature for 3 min. Two hundred microlitres of BL3 was mixed with 200 μl PBS in a 1.5 ml Eppendorf tube. Four hundred microlitres of this BL3–PBS mixture was then added to each sample and the samples mixed 10 times with a P1000 wide-bore pipette set to 600 μl.

Samples were incubated for 10 min at room temperature and then pipette mixed 5 times, then incubated at room temperature for 10 min and pipette mixed 5 times and then further incubated for 10 min at room temperature. A white precipitate may form after addition of BL3. This is normal. A Nanobind disk was added to the cell lysate, then 600 μl of isopropanol was added. Mixing was performed by inversion of the tube 5 times. Tubes were further mixed on a tube rotator (9 r.p.m. at room temperature for 10 min). The tubes were then placed on a magnetic tube rack, and the Nanobind disk positioned closer to the top of the tube to avoid inadvertent removal of the DNA bound to the Nanobind disk. The supernatant was discarded using a pipette and 700 μl of buffer CW1 was added to each tube. The tube in the magnetic rack was then inverted 4 times for mixing. A second and third wash with 500 μl of buffer CW2 (inversion mix 4 times for each wash) was performed. After the second CW2 wash, liquid was removed from the tube cap and the tubes spun on a mini-centrifuge for 2 s, and replaced on the magnetic rack. Residual liquid was removed from the bottom of the tube, taking care not to remove DNA associated with the Nanobind disk. Elution from the Nanobind disk was accomplished by adding 160 μl Circulomics elution buffer (EB) plus 0.02% Triton X-100 (comprising 316.8 μl EB and 3.2 μl 2% Triton X-100) and incubating at room temperature for at least 1 h. Tubes were gently tapped halfway through elution. DNA was collected by transferring eluate with a P200 wide-bore pipette to a new 1.5 ml microcentrifuge tube. Some liquid and DNA remained on the Nanobind disk after pipetting. The tube containing the Nanobind disk was spun in a centrifuge at 10,000g for 5 s, and any additional liquid that came off the disk was transferred to the eluate tube. This process was repeated if necessary until all DNA was removed. The samples were pipette mixed 5 times (approximately 10 s to aspirate and 10 s to dispense for each cycle) with a wide-bore P200 pipette to homogenize the sample. Samples were further allowed to rest at room temperature overnight to allow DNA to solubilize (disperse).

Library preparation

DNA tagmentation and FRA

Circulomics EB+ (EB buffer with 0.02% Triton X-100) was prepared, and 140.82 μl EB+ was aliquoted into a 1.5 ml Eppendorf DNA Lo-Bind tube. UHMW DNA (300 μl) from above was aliquoted into the same tube with a wide-bore P200 pipette. The mixture was slowly pipetted up and down 3 times with a wide-bore P200 pipette set to 150 µl. In a separate 1.5 ml Eppendorf DNA Lo-Bind tube, the following reagents were added in sequence: 144 μl of FRA dilution buffer, 9.18 μl of 1 M MgCl2 and 6 μl FRA. The tube was tapped to mix and spun down using a microcentrifuge. The EB–Triton X-100–DNA mixture was added to the FRA dilution buffer–MgCl2–FRA mixture with a wide-bore P200 pipette. This mixture was then pipette mixed 15–20 times with a wide-bore P1000 pipette set to 600 µl. The mixture appeared homogeneous when pipette mixing was finished. The tube was then incubated for 15 min at room temperature. The mixture was then pipette mixed 5 times with a wide-bore P1000 pipette set to 600 µl and incubated at room temperature for an additional 15 min. The mixture was incubated at 30 °C for 1 min, followed by 80 °C for 1 min and then held at 4 °C.

FRA clean-up

Clean-up used a Nanobind disk. A 5 mm Nanobind disk was added to the above-described reaction mixture followed by 300 µl of Circulomics buffer NAF10. The tube was gently tapped 10–20 times to mix. The mixture was placed on a platform rocker at 20 r.p.m. for 2 min at room temperature. A DNA ‘cloud’ was visible on the Nanobind disk. The tube was spun for 1–2 s using a benchtop microcentrifuge and placed on a magnetic rack. The binding solution was removed and discarded. The Nanobind disk was washed by adding 350 µl ONT long fragment buffer (LFB) and gently tapped 5 times to mix. The tube was spun for 1–2 s using a microcentrifuge and placed on a magnetic rack. The ONT LFB was removed and discarded. Care was taken to not pipette DNA attached to the Nanobind disk. This LFB wash was repeated. The tube was then briefly spun (microcentrifuge) to move the Nanobind disk to the bottom of the tube. DNA was eluted from the Nanobind disk by adding 125 µl of ONT EB to the tube. The tube was incubated for 30 min at room temperature then gently tapped 5 times (mixing) and incubated for an additional 30 min at room temperature. Fluid was slowly aspirated 4 times over the Nanobind disk before removing the eluate from the tube. The eluate was transferred to a new 1.5 ml Eppendorf DNA Lo-Bind tube using a wide-bore P200 pipette. The eluate was then pipette mixed 2 times with a wide-bore P200 pipette.

Adapter attachment and rapid adaptor

Rapid adaptor (RAP) was added to the DNA preparation. To 120 µl of eluate (from above), 3 µl of ONT RAP was added. The mixture was pipette mixed 8 times with a wide-bore P200 pipette. The mixture was then incubated for 15 min at room temperature and then again pipette mixed 8 times with a wide-bore P200 pipette.

RAP reaction clean-up with Nanobind

The final library clean-up step removes unligated adaptor. In brief, 120 µl Circulomics EB was added to 123 µl of the above-described RAP reaction mixture. The mixture was slowly pipette mixed 3 times with a wide-bore P1000 pipette set to 240 µl. Each aspiration took about 10 s and each dispense took around 10 s. A 5 mm Nanobind disk was added to the reaction mixture followed by 120 µl Circulomics buffer NAF10. Mixing was accomplished by gentle tapping. The tube was incubated for 5 min at room temperature without agitation or rotation. The tube was gently tapped 5 times (each time for 2–3 times) during the 5 min of incubation. The tube was spun for 1–2 s using a microcentrifuge and placed on a magnetic rack. The binding solution was discarded. Next 350 µl of ONT LFB was added to the tube and mixed by gentle tapping 5 times. The tube was then spun for 1–2 s using a microcentrifuge and placed on a magnetic rack. The ONT LFB was removed and discarded. Next the Nanobind disk was washed by adding 350 µl ONT LFB. The tube was gently tapped 5 times to move LFB over the surface of the disk. The tube was then incubated at room temperature for 5 min. The tube was then spun for 1–2 s using a microcentrifuge and placed on a magnetic rack. The ONT LFB was removed and discarded. The tube was briefly spun using a microcentrifuge to move the Nanobind disk to the bottom of the tube. To elute DNA from the Nanobind disk, 126 µl ONT EB was added to the tube. The tube was incubated for 30 min at room temperature, then gently tapped 5–10 times and incubated for an additional 1–2 h at room temperature. The eluate was then transferred to a new 1.5 ml Eppendorf DNA Lo-Bind tube using a wide-bore P200 pipette using the same technique described above for passing the eluate over the Nanobind disk before removing the eluate from the tube. The mixture was then pipette mixed 2–3 times with a wide-bore P200 pipette. The library was stored overnight at 4 °C before sequencing to facilitate maximal dissolution of DNA.

Flow cell loading and sequencing

ONT sequencing buffer (SQB) (68 µl) was added to 82 µl of the eluate from above. The mixture was pipette mixed 4 times with a wide-bore P200 pipette set to 150 µl. Each aspiration of 150 µl took 10–20 s, and each dispense of 150 µl took 10–20 s. Samples were then incubated at room temperature for 10 min. Next the samples were pipette mixed 8 times with a wide-bore P200 pipette set to 150 µl. Before loading the library, the flow cell was primed with flush buffer/flush tether mixture per ONT directions. The library was then added to the flow cell. The mixture was viscous but loaded smoothly in about 1 min. Some samples took 2 min maximum to load. The sequencing run had a re-mux time set for every 6 h. Base calling was performed using Guppy (v.4.0.11), with default parameters and the high-accuracy PromethION model (dna_r9.4.1_450bps_hac_prom.cfg).

Dovetail Omni-C

We prepared Omni-C libraries from each cell line using a Dovetail Omni-C kit (Dovetail Genomics) with the following modifications as. First, we aliquoted 1 million cells for fixation with formaldehyde and DSG. We digested chromatin with DNAse I until DNA fragments of a desired length were obtained. Per the protocol, we performed end repair on the chromatin, followed by ligation of a biotinylated bridge oligonucleotide, followed by ligation of free chromatin ends. We reversed crosslinks and purified proximity ligated DNA. We converted the DNA into an Illumina sequencing library using a NEB Ultra II library preparation kit (NEB) with a Y-adaptor. We enriched for ligation products using streptavidin bead capture on the final library. Each capture reaction was then split into two replicates before the final PCR enrichment step to preserve complexity. All libraries were uniquely dual indexed and sequenced on an Illumina Novaseq Platform with read lengths of 2 × 150 bp.

Phased assembly pipeline

We describe the main automated and manual steps taken before, during and after assembly. A combined set of workflow description language (WDL) formatted assembly workflows is available from Dockstore that captures each of the steps for filtering adapter-contained reads and running Hifiasm (dockstore.org/organizations/HumanPangenome/collections/Hifiasm). All assemblies were generated using this workflow, running on AnVIL71. Cleaning assemblies and fixing some structural issues were performed through a combination of automated workflows and manual curation as described below. Manual curation was performed using Jupyter notebooks available at GitHub (github.com/human-pangenomics/hpp_production_workflows/tree/master/assembly/y1-notebooks).

Filtering adapter-contained reads and running Hifiasm

Before producing the assemblies, we detected and removed the reads containing PacBio adapters using a bash script from the HiFiAdapterFilt repository72 (commit 64d1c7b). This script first creates a database of the PacBio adapter sequences, as illustrated below:

>gnl|uv|NGB00972.1:1-45 Pacific Biosciences Blunt Adapter
ATCTCTCTCTTTTCCTCCTCCTCCGTTGTTGTTGTTGAGAGAGAT
>gnl|uv|NGB00973.1:1-35 Pacific Biosciences C2 Primer
AAAAAAAAAAAAAAAAAATTAACGGAGGAGGAGGA

It then runs blastn with tuned parameters to detect adapter-containing reads as follows:

blastn -db ${DATABASE} -query ${HIFI_FASTA} -task blastn -reward 1 -penalty -5 -gapopen 3 -gapextend 3 -dust no -soft_masking true -evalue 700 -searchsp 1750000000000 -outfmt

For 43 samples (out of 47), we removed less than 0.15% of the reads; all of the 29 HPRC-selected samples are among these 43 samples, indicating the low level of adapter contamination in the HiFi data produced by the HPRC. HG005, which is one of the 18 additional samples, had the highest adaptor contamination percentage, at about 1%. (Supplementary Fig. 39)

The removed reads were then aligned to the T2T-CHM13 (v.2.0) reference to ensure that there was no chromosomal or locus-specific bias in the filtering process. Supplementary Fig. 40 shows a snapshot of the IGV browser73 illustrating the coverage of the adapter-containing reads along the genome. The locations of the reads were almost evenly distributed along the genome and, excluding centromeres, we barely found any region covered with more than two adapter-containing reads, even in HG005, which had the highest contamination percentage.

The trio-binning mode of Hifiasm needs haplotype-specific k-mers for trio phasing the assembly graph. To generate these k-mers, we used parental Illumina short reads for the 47 HPRC samples, which are publicly available from the 1KG dataset19. For each parental short-read sample, we used yak count (v.0.1)74 to generate the k-mer hash tables, running it once for each of the paternal and maternal read sets:

yak count -k31 -b37 -o pat.yak paternal.fq.gz
yak count -k31 -b37 -o mat.yak maternal.fq.gz

The adapter-filtered HiFi reads along with the parental k-mer tables were then given to Trio-Hifiasm (v.0.14) to produce haplotype-resolved assembly graphs. Only the sample HG002 was re-assembled with Trio-Hifiasm (v.0.14.1), which is explained in more detail in the next subsection.

hifiasm -o ${SAMPLE_NAME} -t 48 -1 pat.yak -2 mat.yak hifi.fq.gz

Hifiasm produces one graph per haplotype in GFA format. Each haplotype-specific GFA file was then converted to FASTA format using Gfatools75. The assemblies produced by Trio-Hifiasm (v.0.14) are released under v.2 after performing the three cleaning steps described at the end of this section.

Manually fixing issues

We used paftools.js asmgene, from the minimap2 repository (github.com/lh3/minimap2/tree/master/misc)76 to count the number of apparent gene duplications for each of the assemblies produced by Trio-Hifiasm (v.0.14). Asmgene does not distinguish between true duplicates and errors. Looking at its results, we were able to find the duplication trend and detect any outlier. This assessment acted as a proxy for detecting high-level duplication errors. We used the Ensembl v.99 cDNA sequences77 as the input gene set for running asmgene.

# Aligning genes to GRCh38 and each Hifiasm haploid assembly
minimap2 -cx splice:hq hs38.fa cdna.fa > hs38.paf
minimap2 -cx splice:hq ${pat/mat}.fa cdna.fa > ${pat/mat}.paf
# Detecting gene duplications
paftools.js asmgene -a hs38.paf ${pat/mat}.paf

Three samples were detected as outliers in terms of the number of gene duplications. To identify the cause of this issue, we aligned back the HiFi reads to those assemblies and checked the depth of coverages and mapping qualities. It showed that the samples HG01358, HG01123 and HG002 contained false duplications in at least one haplotype of length around 55 Mb (in h1tg000058l contig), about 14 Mb (h1tg000013l) and around 70 Mb (h2tg000045l), respectively. In the assembly graphs of HG01358 and HG01123, the duplicated HiFi reads that appeared multiple times were used as anchors to manually determine the exact boundaries of the duplicated regions in the contigs. These two contigs were then manually fixed by breaking the contigs at the duplication start and end points and discarding the duplicated sequence from the assembly. In detail, for HG01123 for h1tg000013l, we discarded the interval [94439457, contig end]. For HG01358 for h1tg000058l, we kept the interval [0, 95732608), renaming the contig to h1tg000058l_1, we discarded the interval [95732608, 150395342) and kept the interval [150395342, contig end], renaming it to h1tg000058l_2. To address the false duplication in HG002 we re-assembled it using a newer version of Trio-Hifiasm (v.0.14.1), which was reported not to have this problem.

We also evaluated the phasing accuracy of the assemblies by using yak trioeval (see below). We detected a single large misjoin in a maternal contig of the HG02080 assembly. It contained an approximately 22-Mb-long paternal block in the middle of the contig, which resulted in two switch errors at the edges of this block. This block was manually discarded from the assembly and the contig was broken into two smaller ones. In detail, in HG02080 for the h2tg000053l contig, we kept the interval [0, 41506503), renaming it to h2tg000053l_1, we discarded the interval [41506503, 63683095) and kept the interval [63683095, contig end], renaming it to h2tg000053l_2.

We finally searched for interchromosomal misjoins using the Minigraph pangenome (see below for construction details). An ‘interchromosomal misjoin’ was defined by a chimeric Minigraph alignment (see below) consisting of ≥1 Mb subalignments on different chromosomes.

Cleaning steps

To clean the raw assemblies, we performed three additional steps: masking the remaining HiFi adapters, dropping the contigs that were contaminated in their entirety and removing any redundant mitochondrial contigs.

In the first cleaning step, the sequence of the PacBio SMRTbell adapter was aligned to each assembly using minimap2 with the parameters -cxsr -f5000 -N2000 –secondary=yes. We extracted only the hits with less than or equal to 2 mismatches and which were longer than 42 nt. In addition, eukaryotic adapters in each assembly were identified by VecScreen78. The combined minimap2 and VecScreen adaptor hits (when present) were hard-masked in the assemblies using a WDL of the bedtools maskfasta command (dockstore.org/workflows/github.com/human-pangenomics/hpp_production_workflows/MaskAssembly:master?tab=info).

bedtools maskfasta \
-fi ${inputFastaFN} \
-bed ~{adapterBed} \
-fo ~{outputFasta}

In the second cleaning step, we used VecScreen to detect mitochondrial contigs and contigs consisting of nonhuman sequences from other organisms, such as bacteria, viruses and fungi. These contigs were then dropped from the assemblies using a WDLized version of samtools faidx. It is worth noting that the contigs with nuclear mitochondrial DNAs within them were not dropped.

samtools faidx \
$inputFastaFN \
`cat contigsToKeep.txt` | gzip \
> ~{outputFasta}

In the last cleaning step, we selected one contig as the best mitochondrial contig per diploid assembly. To do this, selection we aligned the sequence of the mitochondrial DNA (with the RefSeq identifier of NC_012920.1) to each diploid assembly using minimap2 with the parameters -cx asm5 –cs. Then we selected one contig with the highest mapping score and the lowest number of mismatches as the best mitochondrial contig (we selected one randomly if multiple best contigs existed). This contig was then rotated and flipped (if necessary) to match the start and orientation of NC_012920.1.fa and then added to the maternal assembly of the corresponding sample. Only the HG01071 sample did not produce any identifiable mitochondrial contig.

Masked and cleaned mitochondrial assemblies were then accessioned to GenBank, where they underwent another round of adapter masking and removal of contamination, which was mostly Epstein–Barr virus used to generate the LCLs. The final cleaned assemblies in GenBank were downloaded, and the contig identifiers were pre-pended with the sample name and haplotype integer (where 1 = paternal and 2 = maternal). For example, a contig assigned the name JAGYVH010000025 in sample HG02257’s maternal assembly was renamed to be HG02257#2#JAGYVH010000025. The renamed assemblies were then released to our Amazon Simple Storage Service (S3) and Google Cloud Platform (GCP) buckets. In the process of downloading from GenBank, three of the assemblies (HG00733 paternal, HG02630 paternal and NA21309 maternal) had their downloads prematurely stopped, which resulted in missing sequences. Notably, NA21309 is missing its mitochondrial contig. Details can be found on the HPRC’s year 1 assembly GitHub repository (github.com/human-pangenomics/HPP_Year1_Assemblies). The assemblies held at the International Nucleotide Sequence Database Collaboration are not truncated, but the truncated copies were retained in S3 and GCP as they were used in the construction of the pangenomes.

After submission to GenBank, the assemblies were aligned against CHM13 using Winnowmap, and multiple contigs were found to be unmapped. These contigs were subjected to BLAST and found to be almost exclusively Epstein–Barr virus sequence. GenBank confirmed (personal communication) that these unmapped contigs should have been dropped as contamination, but since the genomes were already in active use, they elected not to remove them at this time. A list of the contigs that should have been dropped can be found on the year 1 assembly GitHub repository (github.com/human-pangenomics/HPP_Year1_Assemblies/blob/main/genbank_changes/y1_genbank_remaining_potential_contamination.txt).

Assembly assessment pipeline

Several steps in assembly assessment were managed through a StandardQC workflow written using WDL, run on AnVIL, and available at Dockstore (dockstore.org/workflows/github.com/human-pangenomics/hpp_production_workflows/StandardQC). Individual tools within the workflow were run in Docker containers with specific tool versions installed for consistency and reproducibility. Details are available within the Dockstore-deposited workflow. The StandardQC workflow takes short-read data for parental and child samples, the two assembly haplotypes, and it produces an analysis over various quality metrics produced by the tools described below. For each task, the workflow produced a small human-readable summary file, which is also easy to parse for summarizing steps, as well as the full output from the tool for manual inspection. Specific tool invocations can be determined from the deposited workflow and are described in the subsequent sections.

Measuring interchromosomal joins

Contigs were aligned to CHM13 (v.2.0) with Minigraph (v.0.18) and processed with the following command line:

minigraph -cxasm chm13v2.0.fa contigs.fa | paftools.js misjoin –

The ‘misjoin’ command reports an interchromosomal join if a contig has two ≥1 Mb alignments to two different chromosomes.

Assembly contiguity assessment

Assembly contiguity was assessed for each haplotype using QUAST79. These statistics included total sequence assembled, total assembled contigs and contig NG50 (assuming a genome size of 3.1 Gb). All reference-based analyses were skipped.

QUAST was invoked with the following command:

python /opt/quast/quast-5.0.2/quast-lg.py -t 16 -o <sample>.quast –large –est-ref-size 3100000000 –no-icarus

Assembly QV assessment

Assembly QV was determined using two separate k-mer-based tools. The first is Yak20. Yak’s QV estimation happens separately on each haplotype. The k-mer databases for Yak were generated using the following command:

yak count -t16 -b37 -o <sample>.yak <(cat <read_files>) <(cat <read_files>)

QV estimation using Yak was generated with the following command:

yak qv -t 32 -p -K 3.2g -l 100k <sample>.yak
<sample_assembly_haplotype> > <sample>.<haplotype>.yak.qv.txt

Assembly QV was also determined using Meryl and Merqury80. Meryl generates k-mer databases and Merqury determines haplotype QV jointly with both haplotypes.

The k-mer databases with Meryl were generated with the following commands. Databases were generated separately for each read file using meryl count and merge with meryl union-sum. Parental-specific k-mers (hap-mers) were generated using merylu hapmer.

meryl k=21 threads=64 memory=32 count output <sample>.meryl <read_file>
meryl union-sum output <sample>.meryl <sample_read_meryl_files>
bash hampers.sh maternal.meryl paternal.meryl sample.meryl

QV estimation using Merqury was generated with the following command:

merqury.sh sample.meryl maternal.meryl paternal.meryl
<maternal_haplotype> <paternal_haplotype> <sample>.merqury

GIAB-based assembly quality assessment

As a complementary and stratified assessment of assembly quality, we used the GIAB assembly benchmarking pipeline to compare assembly-based variant calls to GIAB’s small variant benchmarks (v.4.2.1) for two GIAB samples assembled in this work: HG002 and HG005. We evaluated the HG002 and HG005 HPRC assemblies aligned to GRCh38. Variants were called from assemblies using Dipcall (v.0.3) (using mimimap2 (v.2.2.4))38. We used -z200000,10000 parameter to improve alignment contiguity, as it was previously shown to improve variant recall in regions with dense variation, such as the major histocompatibility complex81. Small variant evaluation was performed using hap.py (v.3.15)82, benchmarking against v.4.2.1 of high-confidence SNP, small indel and homozygous reference calls for the GIAB samples HG002 and HG005. Comparisons were performed with and without restriction to the associated dipcall region file (dip.bed) to assess recall within and outside assembled regions. For better comparisons of complex variants, hap.py was run using vcfeval83. Variant calls were stratified using GIAB stratifications (v.3.0)84, stratifying true-positive, false-positive and false-negative variant calls in challenging and targeted regions of the genome.

Trio-based assembly phasing assessment

Assembly phasing was assessed using Yak and described using two statistics: switch error and Hamming error rates. Switch error describes the number of times two adjacent phased variants incorrectly switch between maternal and paternal haplotypes. Hamming error rate relates to the total number of misphased variants per assembled contig. Yak generates phasing statistics separately for each haplotype using parental k-mers gathered from Illumina short-read sequencing of the parents.

Yak generates k-mer databases for the sample and both parental haplotypes (as described above). We used Yak to generate phasing metrics with the following command:

yak trioeval -t 32 paternal.yak maternal.yak <haplotype_assembly> > <sample>.<haplotype>.yak_phasing.txt

Hi-C-based assembly phasing assessment

An alternative approach for phasing evaluation is to use Hi-C reads that do not require trio information. We computed the switch error rate for local phasing evaluation and the Hamming error rate for global phasing evaluation. We implemented an efficient k-mer-based method in pstools (v.0.1)24 and used maximum Hi-C read support to detect switch errors on heterozygous positions. In this procedure, we first identified heterozygous k-mers (hets) from phased assemblies using 31-mers. Then we mapped Hi-C reads to the assemblies using these 31-mers. If there were >5 reads that supported a switch between consecutive hets in assemblies, we considered a haplotype switch. For each het pair, we noted whether Hi-C reads supported or did not support the phase. We considered a switch error when a het site had a phase switched support relative to that of the previous heterozygous site. The switch error rate is the number of local switches divided by the number of heterozygous sites. We performed this operation for the entire contig over all contigs for switch calculations. For the Hamming error calculations, we considered Hamming distance on the entire contig level divided by the number of heterozygous sites. This measure gives a global view of phasing errors and implicitly penalizes any long switches in contigs.

Assembly read-based evaluation of Flagger

The following describes the generation and cleaning of the HiFi alignments to the HPRC assemblies and running Flagger (v.0.1), a read-based pipeline for evaluating diploid/dual assemblies. All the WDL-based workflows for running these steps are deposited in the Dockstore collection (dockstore.org/organizations/HumanPangenome/collections/Flagger-Secphase).

Preparing the HiFi alignments

We aligned back the HiFi reads of each sample to its diploid assembly. The alignments were produced with winnowmap (v.2.03) using the following commands:

# making the k-mer table with meryl
meryl count k=15 output merylDB asm.fa
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
# alignment with winnowmap
winnowmap -W repetitive_k15.txt -ax map-pb -Y -L –eqx –cs -I8g <(cat pat_asm.fa mat_asm.fa) reads.fq.gz | samtools view -hb > read_alignment.bam

For all samples, we used the full HiFi read sets mentioned in Supplementary Table 1, except HG002, for which we downsampled the read set to 35×.

To exclude unreliable alignments, we removed all chimeric alignments and alignments shorter than 2 kb or with a gap-compressed mismatch ratios higher than 1%. As the assembly is diploid and the reads aligned to the homozygous regions are expected to have low mapping qualities, we did not filter alignments on the basis of their mapping qualities. In Supplementary Fig. 41, we plot the histograms of mapping qualities and the distributions of alignment identities for one sample, HG00438, as an example. The statistics of three sets of alignments were plotted: the alignments to the diploid assembly and to each haploid assembly (maternal and paternal) separately. It indicates that the reads have higher identities when the diploid assembly is used as reference but about 20% more reads have mapping qualities lower than 10.

Generally, in highly homozygous regions, the aligner may not be able to select the correct haplotype as the primary alignment because of either read errors or misassemblies. To detect these cases, we searched for secondary alignments for which the scores were almost as high as the primary alignment of the same read. For each such read, we made a pseudo-multiple alignment of the read sequence and the assembly blocks captured by all secondary and primary alignments. Using this alignment, we searched for the read bases that were mismatched in at least one alignment but not all alignments. We called such bases single nucleotide markers. For each alignment, we calculated a consistency score by considering only the single nucleotide markers and taking the summation of their base qualities with a negative sign. We then sorted the alignments (regardless of being primary or secondary) based on this score. If the best alignment was a secondary alignment, we assigned the primary tag to this alignment and removed the other alignments. The percentage of the total reads with swapped alignments ranged from 0.03% (HG03453) to 0.44% (HG005) across 47 HPRC samples. This result shows that only a small percentage of the reads needed to be relocalized using this method. This step was performed through the Secphase (v.0.1) workflow, which is available in the Dockstore collection (dockstore.org/organizations/HumanPangenome/collections/Flagger-Secphase).

By calling variants, it is possible to detect the regions that either need polishing (that is, are errors) or that have alignments from the wrong haplotype because of mismappings. We used DeepVariant (v.1.3.0) with the parameter –model_type=“PACBIO” to call variants on these alignments. The variants were then filtered to include only the biallelic SNPs with a variant frequency higher than 0.3 and genotype quality higher than 10.

bcftools view -Ov -f PASS -m2 -M2 -v snps -e ‘FORMAT/VAF < 0.3 || FORMAT/GQ < 10’ ${OUTPUT_VCF} > ${SNPS_VCF}

Having the biallelic SNPs, we found the alignments with alternative alleles and removed them from the bam file. For this step, we implemented and used the program filter_alt_reads, running the following command:

filter_alt_reads -i ${INPUT_BAM} -o ${ALT_FILTERED_BAM} -f ${ALT_BAM} -v ${SNPS_VCF}

Running the evaluation pipeline

To assess the read mappings resulting from our diploid alignment process, we used the following five steps, which are combined into a pipeline that we refer to as Flagger. Flagger essentially fits a mixture model to successive coverage blocks of the read-to-diploid assembly alignment and then classifies each block to a category predicting the accuracy of the assembly at that location.

Step 1: calculating depth of coverage

First, after producing and cleaning the HiFi alignments, we calculated the depth of coverage for each assembly base by samtools depth -aa (the -aa option allows outputting the bases with zero coverage):

samtools depth -aa -Q 0 read_alignment.bam > read_alignment.depth

The output of samtools depth was then converted into a more efficient format with the .cov suffix. This format is implemented specifically for Flagger and is more efficient, as the consecutive bases with the same coverage take only one line. We implemented a program called depth2cov for converting the output of samtools depth to the .cov format.

depth2cov -d read_alignment.depth -f asm.fa.fai -o read_alignment.cov

Step 2: fitting the mixture model

In the second step, the frequencies of coverages were calculated using cov2counts. The output file with the .count suffix is a two-column tab-delimited file: the first column shows coverages and the second column shows the frequencies of those coverages.

cov2counts -i read_alignment.cov -o read_alignment.counts

The python script fit_gmm.py takes a file .counts suffix, fits a Gaussian mixture model and finds the best parameters through expectation-maximization. This mixture model consists of four main components and each component represents a specific type of region:

  1. (1)

    Erroneous component, which is modelled by a Poisson distribution. To avoid overfitting, this mode only uses the coverages below 10 so its mean is limited to be between 0 and 10. It represents the regions with very low read support.

  2. (2)

    (Falsely) duplicated component, which is modelled by a Gaussian distribution, the mean of which is constrained to be half of mean of the haploid component. It should mainly represent the falsely duplicated regions.

  3. (3)

    Haploid component, which is modelled by a Gaussian distribution. It represents blocks with the coverages that we expect for the blocks of an error-free assembly.

  4. (4)

    Collapsed component, which is actually a set of components each of which follows a Gaussian distribution, the mean of which is a multiple of the mean of the haploid component. It represents regions that have additional copies present in the underlying genome that have been ‘collapsed’ into a single copy.

It was noted that the model components may change for different regions owing to regional coverage differences and that the resulting systematic differences affect the accuracy of the partitioning process. To make the coverage thresholds more sensitive to the local patterns, the diploid assembly was split into windows of length (5–10 Mb) and a distinct model was fit for each window. Before fitting, we split the whole-genome coverage file produced in the first step into multiple coverage files for each window. We implemented and ran split_cov_by_window for splitting:

split_cov_by_window -c read_alignment.cov -f asm.fa.fai -s 5000000 -p ${OUTPUT_PREFIX}

This produced a list of coverage files, each of which ends with ${CONTIG_NAME}_${WINDOW_START}_${WINDOW_END}.cov

We then repeated the above-described steps for each resulting coverage file.

One important observation is that for short contigs, the coverage distribution is generally too noisy to satisfactorily fit the mixture model. To address this issue, we performed the window-specific coverage analysis only for the contigs longer than 5 Mb. For the shorter contigs, we used the results of the whole-genome analysis.

Step 3: extracting blocks of each component

Using the fitted model, we assigned each coverage value to one of the four components (erroneous, duplicated, haploid and collapsed). To do so for each coverage value, we picked the component with the highest probability. For example, the coverage value 0 is frequently assigned to the erroneous component. In Supplementary Fig. 42, the coverage intervals are coloured based on their assigned component.

Step 4: incorporating coverage biases in HSats

According to an article describing a complete human genome3, there are some satellite arrays (for example, HSat1, HSat2 and HSat3) for which the HiFi coverage drops or increases systematically owing to biases in sample preparation and sequencing. Such platform-specific biases mislead the pipeline. As a result, the falsely duplicated component may contain a mixture of falsely duplicated and coverage-biased blocks. Similar effects occur for the collapsed component.

To incorporate such coverage biases and to correct the results in the corresponding regions, we first found the regions of each haploid assembly for which a coverage bias is expected. To find such regions, we lifted over the CHM13 HSat1, HSat2 and HSat3 annotation to each assembly by aligning the assembly contigs to the reference T2T-CHM13 (v.1.1) and GRCh38 (chromosome Y) and projecting the HSat coordinates back to the assembly (using python script project_blocks.py). Then we ran fit_gmm.py to fit a mixture model for the blocks assigned to each HSat type and adjusted the parameter –coverage, the starting point of the expectation-maximization process, based on the expected coverage in the corresponding HSat. For HSat1, HSat2 and HSat3 we set –coverage to 0.75, 1.25 and 1.25 times the average sequencing coverage, respectively. Finally, we decomposed each HSat based on the inferred coverage thresholds and replaced the previous assigned component by the new one.

Step 5: using high-quality alignments to correct spurious flags

In some cases, the duplicated component was mixed up with the haploid one. It usually happens when the coverage in the haploid component drops systematically or the majority of a long contig is falsely duplicated. To address this issue, we used another indicator of a false duplication, which is the accumulation of alignments with very low mapping quality (MAPQ). We produced another coverage file using only the alignments with MAPQ > 20. Whenever we found a region flagged as duplicated with more than five high-quality alignments, we changed the flag to haploid.

After the correction made in step 5, we merged blocks from each component closer than 1,000, and the overlap of any two components after merging was flagged as ‘unknown’ to show that this block could not be properly assigned. The BED files produced by Flagger are available in the HPRC S3 bucket (s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/e9ad8022-1b30-11ec-ab04-0a13c5208311–COVERAGE_ANALYSIS_Y1_GENBANK/FLAGGER/APR_08_2022/FINAL_HIFI_BASED/FLAGGER_HIFI_ASM_SIMPLIFIED_BEDS/).

Assessing T2T-CHM13 using Flagger

To estimate the false-positive rate of Flagger, we applied it to the T2T-CHM13 (v.1.1) reference. The direct output of Flagger showed that about 12.77 Mb (around 0.41%) of the T2T-CHM13 reference assembly was flagged as potentially unreliable. The HPRC assemblies were almost free of rDNA arrays, but there were modelled sequences for rDNA arrays in the T2T-CHM13 (v.1.1) reference. These arrays were flagged as falsely duplicated in their entirety, which indicated that Flagger with HiFi reads may not be able to correctly evaluate rDNA arrays. Therefore, to make a fair comparison, we excluded rDNA arrays (about 9.92 Mb in total) from the reference evaluation, which decreased the number of unreliable bases to 5.58 Mb (around 0.18%). We additionally identified about 2.76 Mb of a region beside chromosome 1–HSat2 that was mis-flagged as collapsed. This mis-flagging was the impact of the systematic coverage rise on the neighbouring HSat2 that altered the fitted mixture model. By manually fixing this mis-flagging, we had about 2.82 Mb (0.09%) of unreliable blocks in T2T-CHM13 (v.1.1). This number is about 9.3 times lower than the average for the HPRC assemblies. These unreliable blocks are mainly a combination of ‘Unknown’ blocks, which could not be properly assigned and the regions with HiFi-specific coverage drops. The results of this analysis are available in the HPRC S3 bucket (s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=submissions/e9ad8022-1b30-11ec-ab04-0a13c5208311–COVERAGE_ANALYSIS_Y1_GENBANK/FLAGGER/APR_08_2022/FINAL_HIFI_BASED/T2T-CHM13/).

Repeat masking

Repeat masking on each assembly was iteratively performed using RepeatMasker (v.4.1.2-p1). The first step masked used the default human repeat library, and the second step used a repeat library augmented by CHM13 satellite DNA sequences on the original assemblies after hard masking the initial repeat masked DNA. The augmented repeat library (final_consensi_gap_nohsat_teucer.embl.txt) is available at Zenodo (doi.org/10.5281/zenodo.5537107), and a parallelized repeat masking pipeline (RepeatMaskGenome.snakefile) is available at GitHub (github.com/chaissonlab/segdupannotation). The union of the two steps generated the complete repeat masking.

SD annotation

SDs were annotated using sedef85 after masking repeats in each assembly. Repeats annotated with more than 20 copies corresponded to unannotated mobile elements and were excluded from the analysis. The pipeline for annotating SDs is available at GitHub (github.com/ChaissonLab/SegDupAnnotation/releases/tag/vHPRC).

SD reliability

The reliable and unreliable regions for all haplotype assemblies were aligned to T2T-CHM13 (v.2.0) and then subdivided into 5 kb windows and intersected with the SD annotations for T2T-CHM13. SD annotations were unavailable for chromosome Y on T2T-CHM13 (v.2.0) at the time of analysis. Furthermore, the chromosome Y added to T2T-CHM13 is from the HPRC HG002 sample; therefore chromosome Y was excluded. For each class of unreliable region (unknown, erroneous, duplicated, collapsed and haploid), we calculated the average number of base pairs overlapping SDs across the haplotype assemblies and annotated each 5 kb window with the most representative overlapping SD (the SD with the highest product of identity and length). Then using the most representative SD, we calculated the average length and identity of SDs overlapping each class of unreliable region for all the 94 haplotypes and compared the length of identity of SDs that overlapped the different types of errors in the assembly. The code for this analysis is available on GitHub (gist.github.com/mrvollger/3bdd2d34f312932c12917a4379a55973).

Ensembl mapping pipeline for annotation

To create high-confidence annotations, a new Ensembl annotation pipeline was developed. The pipeline clusters and maps spatially proximal genes in parallel (to help avoid issues with individually mapping near identical paralogues) and attempts to resolve inconsistent mappings by both considering the synteny of the gene neighbourhood in relation to the GRCh38 annotation and the identity and coverage of the underlying mappings.

A reference gene set was created from a subset of the GENCODE (v.38) genes29, which was mapped to the HPRC assemblies through a two-pass alignment process. This excluded readthrough genes and genes on patches or haplotypes, and only included one copy of the genes on the X/Y PAR region (only one copy, chromosome X, is modelled in the Ensembl representation of the PAR genes).

First, to minimize the difficulty of mapping near identical paralogues, a jumping window of 100 kb in length was used to identify clusters of genes to map in parallel (Supplementary Fig. 43). The initial window was positioned at the start of the most 5′ gene for each chromosome in the GRCh38 reference and extended 100 kb from the start of the gene. Any genes fully or partially overlapping the window were then included in the cluster. The next 3′ gene that did not overlap the previous window was then identified, and a new window was created and the process repeated. This resulted in both clustered genes and non-clustered genes (genes were considered not clustered when there was only one gene within the window). The regions to map were then identified on the basis of the start of the most 5′ gene and the end of the most 3′ gene in each cluster (or simply the 5′ and 3′ end of the gene in the case of non-clustered genes).

For each region defined in the previous step, anchor points were then selected to help map the region on the target genome. Two 10 kb anchor points were created 5 kb from the 5′ and 3′ edge of the region, and a central 10 kb anchor was created around the midpoint of the region in the GRCh38 genome. The sequences of these anchors were them mapped against the target genome using minimap2 (ref. 86) with the following command:

minimap2 –cs –secondary=yes -N 10 -x map-ont [genome_index] [anchor_file] > [alignment_file]

The resulting hits were examined to determine high-confidence regions in the target genome. High-confidence regions were ones in which all three anchors were on the same top-level sequence, in colinear order, with ≥99% sequence identity and ≥50% hit coverage, and with a similar distance between the anchors when compared with the reference genome. If no suitable candidate region was found with all three anchors, pairs of mapped anchors were then assessed in a similar manner.

The sequence-selected region or regions were then retrieved and aligned against the corresponding GRCh38 region using MAFFT. For each gene, the corresponding exons were retrieved and the coordinates were projected through the alignment of the two regions. Transcripts were then reconstructed from the projected exons. For each transcript, the coverage and identity when aligned to the parent transcript from GRCh38 were calculated.

If the resulting transcript had either a coverage of <98% or an identity of <99%, the parent transcripts were aligned to the target region using minimap2 in splice-aware mode, with the high-quality setting for Iso-Seq/cDNA transcripts enabled. The maximum intron size was set to 100 kb by default. For transcripts with reference introns larger than 100 kb, the maximum intron size was scaled and set as 1.5 times the length of the longest intron (to allow some variability):

minimap2 –cs –secondary=no -G [max_intron_size] -ax splice:hq -u b [expected_target_region] [transcript_sequences] > [sam_file]

For each transcript that mapped to the target genome, the quality of the mapping was assessed on the basis of aligning the original reference sequence with the newly identified target sequence. Again, if the coverage or identity of the aligned sequence was <98% or <99%, respectively, the reference transcript sequence was re-aligned to the target region, this time using Exonerate87. Exonerate, although slower than minimap2, has the ability to handle very small exons and can incorporate CDS data to preserve the CDS (introducing pseudo-introns as needed). The following command was used:

exonerate -options –model cdna2genome –forwardcoordinates FALSE –softmasktarget TRUE –exhaustive FALSE –score 500 –saturatethreshold 100 –dnawordlen 15 –codonwordlen 15 –dnahspthreshold 60 –bestn 1 –maxintron [max_intron_size] -coverage_by_aligned 1 –querytype dna –targettype [target_type] –query [query_file] –target [target_file] –annotation [annotation_file] > [output_file]

When more than one approach was used to model the transcript, the mapping with the highest combined identity and coverage was selected.

For genes not mapped through the initial regional anchors, a second approach was used. The expected location of the gene was located using high-confidence genes mapped during the first phase. High-confidence mappings were those for which there was a single mapped copy of the gene, all the transcripts had mapping scores of 99% coverage and identity on average and the gene also had a similar gene neighbourhood to the neighbourhood in the reference (at least 80% of the of the same genes in common for the 100 closest neighbouring genes in the reference). After this step, the entire genome region underlying the missing gene, including a 5 kb flanking sequence, was mapped against the target genome using minimap2:

minimap2 –cs –secondary=yes -x map-ont [genome_index] [gene_genomic_sequence] > [alignment_file]

The resulting hits were then filtered on the basis of overlap with the expected region that the missing gene should lie in. If there was no expected region calculated (cases in which no pair of high-confidence genes could be found to define the 5′ and 3′ boundaries of the expected location of the missing gene, for example, at the edge of a scaffold) or no hit overlapping the expected region was found, the top reported hit was used providing it passed an identity cut-off of 99%. The selected hit or hits were then extended on the basis of how much of the original reference gene they covered to ensure that minor local variants between the reference and target regions did not lead to the target region being truncated. Once extended, the remaining hits were then clustered on the basis of genomic overlap and merged into unique regions. The missing genes were then attempted to be mapped to these regions using an identical process as described above for the initial mappings, involving MAFFT, minimap2 and Exonerate.

To minimize the occurrences of mis-mapped paralogues, each gene was checked for exon overlap in both the target and the reference. If the overlapping genes were not identical at a locus between the reference and the target, then a conflict was identified. For each gene present, filtering was done to reduce or remove the conflict based on a number of factors, including whether the genes were in the expected location, whether the genes were high-confidence mappings, the average percent identity and coverage of the transcript for the genes and the neighbourhood score. When it was not possible to resolve a conflict between two genes, both were kept. This concluded the primary mapping process.

After this process, potential recent duplications were identified. To search for recent duplications, the canonical transcript of each gene (the longest transcript in the case of noncoding genes, or the transcript with the longest translation followed by the longest overall sequence for protein-coding genes) was selected and aligned across the genome using minimap2 in a splice-aware manner:

minimap2 –cs –secondary=no -G [max_intron_size] -ax splice:hq -u b [genome_index] [input_file] > [sam_file]

Mappings that had exon overlap with existing annotations from the primary mapping process on the target genome were removed. For new mappings that did not overlap existing annotations, the quality of the alignment was then assessed by aligning the mapped transcript sequence to the corresponding reference transcript to calculate the coverage and per cent identity of the mapping. Different coverage and per cent identity cut-offs were used for these mappings on the basis of the type of transcript mapped. Protein-coding and small noncoding transcripts used a coverage and identity cut-off of 95%, whereas long noncoding transcripts used a coverage and identity cut-off of 90%. Pseudogene transcripts had a lower coverage cut-off of 80%, but the same identity cut-off of 90% as long noncoding transcripts.

When looking for new paralogues, for cases in which multiple canonical transcripts mapped to a locus, a single representative transcript was selected. This was based on the following hierarchy of gene biotype groups: coding, long noncoding, pseudogene, small noncoding, and miscellaneous or undefined.

If there were multiple transcripts for the highest represented group, the transcript with the longest sequence was selected as the representative.

Gene annotation quality analysis

Frameshifts

For the Ensembl and CAT gene annotation sets, we identified the locations of frameshifting indels by iterating over the coding sequence of each transcript and looking for any gaps in the alignment. If the gap had a length that was not a multiple of 3, and its length was <30 bp long (to remove probable introns from consideration), the gap was determined to be a frameshift and its location saved to a BED file.

Nonsense mutations

We also analysed the number of nonsense mutations that would cause early stop codons in both the Ensembl and CAT gene annotation sets. We identified nonsense mutations by iterating through each codon in the coding sequence of the predicted transcripts. If there was an early stop codon before the canonical stop codon at the end of the transcript, we saved the location in a BED file.

Validation of mutations using Illumina

For both sets of mutations, we then lifted over the coordinates of the mutations to be on the GRCh38 reference so that we could use existing variant call sets on GRCh38. We used halLiftover to lift over each set of coordinates, using the GRCh38-based HAL file from the MC alignment. Then we used bedtools intersect to intersect with the variant call file for each of the assemblies.

The following sample commands were used:

halLiftover GRCh38-f1g-90-mc-aug11.hal <GENOME_NAME> <MUTATION_BED_FILE> GRCh38 <LIFTED_OVER_BED_FILE>
bedtools intersect -wo -a <LIFTED_OVER_BED_FILE> -b <SAMPLE_MERGED_VCF> > <OVERLAP_OUTPUT_TXT_FILE>

The VCF files used in this intersection were downloaded from the 1KG (urldefense.com/v3/__https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr$i.recalibrated_variants.vcf.gz__;!!NLFGqXoFfo8MMQ!r6nD4EtteJ7k2BauOrREfgIrlxEI2Upx35sNHiqyI8Did-a6UUUzzxGVQkwYkb-bE_rlHQN2Jw2cBdlw7te_-Q$).

Where $i was replaced with each chromosome number. From there, each chromosome VCF was split so that each sample was in its own file using bcftools view. The chromosome files for each sample were combined into one VCF using bcftools concat.

Gene duplication analysis

Duplicated genes were detected as multi-mapped coding sequences using Liftoff88 supplemented by a complementary approach (gb-map) with multi-mapped gene bodies. The combined set was formed by including all Liftoff gene duplications and duplicated genes detected by gb-map.

Liftoff

We ran Liftoff (v.1.6.3) to annotate extra gene copies in each of the assemblies. Liftoff was run with the flag -sc = 0.90 to find additional copies of genes, with an identity threshold of at least 90%. An example command is below:

liftoff -p 10 -sc 0.90 -copies -db <GENCODE_V38_DATABASE> -u <UNMAPPED_FILE> -o <OUTPUT_GFF3> -polish <GENOME_FASTA> <GRCh38_FASTA>

The additional copies of the genes were identified as such in the output gff3 with the field extra_copy_number (equal to anything other than 0). For this analysis, we also only considered genes that were multi-exon, protein-coding genes. The additional gene copies were further filtered to remove any genes outside the ‘reliable’ haploid regions as determined by the Flagger pipeline.

gb-map

The gene-body mapping pipeline identifies duplicated genes by first aligning transcripts of protein-coding and pseudogenes (GENCODE v.38) to each assembly and then multi-mapping the genomic sequences of each corresponding gene. Alignments of at least 90% identity and 90% of the length of the original duplication were considered candidate duplicated genes. Candidates were removed if they overlapped previously mapped transcripts from other genes, low-quality duplications and genes identified through CAT and Liftoff analysis.

Gene family analysis

To account for gene duplications in high-identity gene families, gene families were identified on the basis of sequence alignments from gb-map. Genes that mapped reciprocally with 90% identity and 90% length were considered a gene family. A single gene was selected as the representative gene for the family, and any gene duplication in the family was counted towards that gene.

Pangenome graph construction

Minigraph

Minigraph can rapidly perform assembly-to-graph mappings using a generalization of the minimap2 algorithm34. New SVs of at least 50 bp detected in the mapping can then be added to the graph. To construct a pangenome graph, one chosen reference assembly, GRCh38 in this case, was used as a starting graph, and the mapping and SV addition steps were repeated for each additional assembly, greedily. This iterative approach is analogous to partial order alignment (POA)89. Graphs constructed in this way describe the structural variation within the samples and provide a coordinate system across the reference and all insertions. Minigraph does not produce self-alignments. That is, it will never align a portion of the reference assembly onto another portion of the reference assembly. In this way all reference positions have a unique location within the created pangenome. Minigraph (v.0.14) was used with -xggs options. The input order was GRCh38, CHM13 then the remainder in lexicographic order by sample name.

MC

Graphs constructed by Minigraph only contain structural variation (≥50 bp) by default. The aim of the MC pipeline is to refine the output of Minigraph to include smaller variants, down to the SNP level. Doing so allows the graph to comprehensively represent most variation, as well as to embed the input haplotypes within it as paths, which is important for some applications49. To remove noisy alignments from the MC pangenome, long (≥100 kb) non-reference sequences identified as being satellite, unassignable to a reference chromosome or which appear unaligned to the remainder of the assemblies were removed from the graph. This resulted in a pangenome with significantly reduced complexity that nevertheless maintained all sequences of the starting reference assembly and the large majority of those in the additional haplotypes. The MC pipeline is composed of the five steps described below and in more detail in ref. 35. The script and commands to reproduce this process can be found at GitHub (github.com/ComparativeGenomicsToolkit/cactus/blob/81903cb82ae80da342515109cdee5a85b2fde625/doc/pangenome.md#hprc-version-10-graphs). A newer, simpler version of the pipeline that no longer requires satellite masking can be found at GitHub (github.com/ComparativeGenomicsToolkit/cactus/blob/5fed950471f04e9892bb90531e8f63be911857e1/doc/pangenome.md#hprc-graph).

Paths from the reference, GRCh38, are acyclic in the MC graph. Paths from any other haplotypes can contain cycles (as a result of different query segments mapping to the same target), but they are relatively rare.

  1. (1)

    Satellite masking: Minigraph is unable to map through highly repetitive sequences such as centromeres and telomeres and, as these regions are also enriched for misassemblies (see the section ‘Assembly assessment’ of Results), we decided to exclude them from the MC graphs used in this work. dna-brnn is a tool that uses a recurrent neural network to quickly identify alpha satellites as well as human satellites 1 and 2 (ref. 90). We ran it with its default parameters on all input sequences and cut out any identified regions ≥100 kb, except on the reference. The three satellite families that dna-brnn detects account for the majority of satellite sequences, but not all. As such, gaps ≥100 kb in Minigraph mappings were also removed. They were detected by mapping each assembly, after having removed the dna-brnn regions, to the Minigraph (using the procedure described below). Overall, an average of 188.6 Mb of sequence from each (non-reference) assembly was excluded from the graph.

  2. (2)

    Assembly-to-graph mapping: Minigraph generalizes the fast seeding and chaining algorithms of minimap2, but it does not currently produce exact alignments in cigar strings or otherwise. For this work, an option, –write-mz, was added to report chains of minimizers, which in this case are 15 bp exact matches, and all assemblies were mapped to the Minigraph graph using it. The resulting minimizers were then converted into PAF files with cigars representing exact pairwise alignments between the query contigs and Minigraph node sequences, and all mappings with MAPQ < 5 were excluded, as were overlapping query regions >10 kb.

  3. (3)

    Chromosome decomposition: the Minigraph graphs do not contain inter-chromosomal rearrangements, but the mappings performed in the previous steps can imply them. That is, a contig can partially map to multiple chromosomes. In most cases, these mappings involve similarity across different acrocentric short arms. To avoid introducing misleading interchromosomal events, and because it is necessary to run the subsequent steps individually on chromosomes owing to memory requirements, the mappings were divided by reference chromosome. This was done by splitting the Minigraph into connected components and using the RN tags to determine their corresponding chromosome names. The PAF mappings were used to determine the coverage of each query contig with each chromosome component. This coverage was used to assign each query contig to a single chromosome by choosing the chromosome with the highest coverage. Contigs with insufficient coverage to any chromosome (<90% for contigs with lengths in [1,10 kb); <80% for [10 kb,100 kb), <75% for [100 kb,1 Mb) and <70% for ≥mb.) were considered ambiguous and not included in the graph. In the GRCh38-based graph, all unplaced and random contigs were grouped together into the same component.

  4. (4)

    Cactus base alignment: Cactus is a tool that uses a graph-based approach to combine sets of pairwise alignments obtained from lastz into a multiple genome alignment37. When aligning different species, it uses a phylogenetic tree to progressively decompose the alignment into a subproblem for each ancestral node in the tree. We adapted it to also accept chromosome-scale sequence-to-Minigraph mappings as produced above, and improved its runtime on alignments of many sequences by replacing its base aligner with abPOA91. The core algorithm described in ref. 37 remains unchanged, whereby the pairwise alignments were used to induce a sequence graph, then filtered using the Cactus alignment filtering algorithm, and components of unaligned sequence were then processed by the base alignment and refinement algorithm. The resulting graph was used to infer an ancestral sequence (not explicitly used in this work) and then exported to a hierarchical alignment (HAL) file92. We implemented a converter, hal2vg93 that converts the HAL alignment into a sequence graph in VG format.

  5. (5)

    Post-processing and whole-genome indexing: the following post-processing steps were performed on each chromosome graph. First, unaligned sequences >10 kb in length, including sequences not aligned to Minigraph, were removed to filter out any under-alignment artefacts that might later be mistaken for insertions. Next, GFAffix94 was used to normalize the graphs by merging together redundant node prefixes and suffixes. Nodes were flipped as necessary to ensure that reference paths always visit their forward orientations. The chromosomes were combined into a whole-genome graph, indexed and exported to VCF, all using vg. Patched versions of both the GRCh38-based and CHM13-based graphs were created when it was discovered that short contigs split-mapping to distant locations had induced large deletions. The deletions were removed using vg clip -D 10000000 (and the pipeline has since been corrected to no longer produce them). Allele-filtered graphs, used for short-read mapping, were produced (from the patched graphs) by removing all nodes traversed by fewer than 9 haplotype paths (minimum AF = 10%) using vg clip -d 9 -m 10000. The chromosome HAL files were also combined into a whole-genome HAL file using halMergeChroms, and clipped sequences added back (to facilitate running CAT) using halUnclip.

PGGB

The PGGB uses a symmetric, all-to-all comparison of genomes to generate and refine a pangenome. We applied it to build a pangenome graph from all genome assemblies and references (both GRCh38 and CHM13). The resulting PGGB graph represents all alignment relationships between input genomes in a single graph. The PGGB graph is a lossless model of the input assemblies that represents all equivalently. This arrangement enables all of our pangenome assemblies to be used as reference systems, a property that we used to explore the scope of pangenome variation in a total way. Owing to ambiguous placement of variation in all-to-all pairwise alignments, many SV hotspots, including the centromeres, are transitively collapsed into loops through a subgraph representing a single repeat copy, a feature that tends to reduce the size of variants found in repetitive sequences. In contrast to MC, PGGB does not filter rapidly evolving satellite sequences or the regions that do not reliably align. This increases its size and complexity relative to the MC graph and adds a significant amount of singleton sequences relative to the Minigraph and MC graphs. However, this property enables annotations and coordinates of all assemblies in the pangenome to be related to the graph structure and utilized in subsequent downstream analyses. We applied the PGGB model to investigate the full pangenome and integrate annotations established de novo on the diverse assemblies into a single model for analyses of pangenome diversity and of complex structurally variable loci (MHC and 8p inversion).

PGGB generates a pangenome graph in three phases. (1) Alignment: in the first phase, the wfmash aligner95 is used to generate all-vs-all alignments of input sequences. This method, wfmash, applies the mapping algorithm of MashMap2 to find homologies at a specified length and per cent identity. It then derives base-level alignments using a high-order version of the WFA algorithm (wflign), which first aligns sequences in segments of 256 bp, then patches up the base-level alignment with local application of WFA. wfmash was designed and developed specifically for the problem of building all-to-all alignments for large pangenomes. (2) Graph induction: the input FASTA sequences and PAF-format alignments produced by wfmash are converted to a graph (in GFA format) using seqwish96. This losslessly transforms the input alignments and sequences into a graph. (3) Graph normalization: we applied a normalization algorithm—smoothxg97—to simplify complex motifs that occur in STRs and other repetitive sequences, as well as to mitigate underalignment. The graph is first sorted using a path-guided stochastic gradient descent method98 that organizes the graph in one-dimension to optimize path distances and graph distances. This sort provides a way to partition the graph into smaller pieces over which we applied a multiple sequence alignment algorithm (abPOA)91. These pieces were laced back into a final graph. We iterated this process twice using different target POA lengths to remove boundary effects caused at the borders of the MSA problems. Finally, we applied GFAffix94 to remove redundant furcations from the topology of the graph.

To build the HPRC PGGB graph, we used both the CHM13 and GRCh38 references as a target and mapped all contigs against these with wfmash, requiring a full length mapping at 90% total identity, collecting all contigs that mapped to a given chromosome. Contigs that did not map under this arrangement were then partitioned using a split-mapping approach, requiring 90% identity over 50 kb to seed the mappings, and putting the contig into the chromosome bin for which it had the best split mapping. We thus initially partition the data into 25 chromosome sets: one for each autosome, one for each sex chromosome, and finally the mitochondria.

We then applied PGGB (v.0.2.0+531f85f) to each partition to build a chromosome-specific graph. Run in parallel over 6 PowerEdge R6515 AMD EPYC 7402P 24-core nodes with 384 GB of RAM, this process requires 22.49 system days, or around 3.7 days wallclock. To develop a robust process to build the HPRC graph, the PGGB team iterated the build 88 times. The final chromosome graphs were compacted into a single ID space using vg ids -j, then for each reference (GRCh38 and CHM13) a combined VCF file was generated from the graph with vg deconstruct (v.1.36.0/commit 375cad7).

A handful of key parameters defined the shape of the resulting graph. First, in wfmash, we required >100 kb mappings at 98% identity. We mapped each HPRC assembly contig and reference chromosome (both GRCh38 and CHM13) to all the other 89 input haplotypes. To reduce complexity, and false-positive SNPs resulting from misaligned regions, we applied a minimum match length filter (in seqwish) of 311 bp. This meant that the graph that we induced was relatively ‘underaligned’ locally, and only through normalization in smoothxg did we compress the bubble structures that are produced. For smoothxg, our first iteration attempts to generate 13,033 bp-long POA problems, whereas the second is 13,177 bp. These lengths provided a balanced trade-off between run time and variant detection accuracy.

In addition to a graph (in GFA), PGGB generates visualizations of the graph in one and two dimensions, which show both the topology (two dimensional) and path-to-graph relationship (one dimensional). A code-level description of the build process is provided at GitHub (github.com/pangenome/HPRCyear1v2genbank).

Pangenome graph assessment

Annotating variant sites in pangenome graphs

Variant sites in Minigraph and in MC and PGGB graphs were discovered using gfatools bubble (v.0.5)75 and vg deconstruct99, respectively. Large (>10 Mb) spurious deletions in MC and PGGB graphs were removed using vcfbub (v.0.1.0)100 with options -l 0 -r 10000000. Next, variant sites were classified into small variant (<50 bp) and SV (≥50 bp) sites. The SV sites were then annotated as described in the methods section of article that describes Minigraph34. In brief, the longest allele sequence of each SV site was extracted and stored in the FASTA format. The interspersed repeats, low-complexity regions, exact tandem repeats, centromeric satellites and gaps in the longest allele sequences were then identified using RepeatMasker (v.4.1.2-p1) with the NCBI/RMBLAST (v.2.10.0) search engine and Dfam (v.3.3) database, SDUST (v.0.1)101, ETRF (commit fc059d5)102, dna-brnn (v.0.1)90 and seqtk gap (v.1.3)103, respectively. SDs were identified if the total node length in a site was ≥1,000 bp and ≥20% of bases of these nodes were annotated as SD in the reference or in individual assembly (‘SD annotation’ subsection). To find hits to the GRCh38 reference genome, minimap2 (v.2.24) with options -cxasm20 -r2k –cs was used to align the longest allele sequences to the reference genome. Based on the identified features, SV sites were classified into various repeat classes using mgutils.js anno (github.com/lh3/minigraph/blob/master/misc/mgutils.js) with minor modifications to enable it to work with the files derived from the MC and PGGB graphs.

Pangenome size and growth

We use the heaps tool of the odgi pangenome analysis toolkit98 to estimate how the euchromatic autosomal pangenome grows with each additional genome assembly added. Here we approximated euchromatic regions by non-satellite DNA, which was identified by dna-brnn in the construction of the MC graph (see the ‘MC’ subsection). Although the MC non-reference haplotypes of the MC graph do not contain satellite DNA, the PGGB graph does. Consequently, we subset the PGGB graph to segments contained in the MC graph. We additionally excluded reference haplotypes (GRCh38 and CHM13) from the analysis. We then sampled permutations of the 88 non-reference (neither GRCh38 nor T2T-CHM13) haplotypes. In each permutation, we calculated the size of the pangenome after adding the first 1, 2, …, N haplotypes in both graphs. This produced a collection of saturation curves from which we derived a median saturation curve onto which we fitted a power law function known as Heaps’ Law. The exponent of this function is generally understood to represent the degree of openness—or diversity—of a pangenome39. Summing up, we called odgi heaps -i <graph.gfa> -S -n200 to generate pangenome saturation curves for 200 permutations. Next to calculating a non-permuted cumulative base count, we also counted the number of common (≥5% of all non-reference haplotypes) and core (≥95% of all non-reference haplotypes) bases in the pangenome graphs. To this end, we used a tool called panacus104 and supplied a list of the samples in which they are grouped according to their assigned superpopulation (pangenome-growth -m -t bp <graph.gfa> <sample order>). We repeated the count, this time including only segments of depth ≥2, that is, contained at least twice in any haplotype sequence.

Decomposing pangenome graphs based on allele traversals

Pangenome graphs were decomposed topologically into a set of nested subgraphs, termed snarls, that each correspond to one or a collection of genetic variants. These snarls were then converted to VCF format using vg deconstruct99. Large (>100 kb) deletions in MC and PGGB graphs were removed using vcfbub (v.0.1.0)100 with options -r 100000. To ease the comparison of variants with other call sets for each individual, the multi-sample VCF files were converted to per-sample VCF files using bcftools view -a -I -s <sample name>, and the multiallelic sites were split into biallelic records using bcftools norm -m -any. Owing to the limitations of snarl decomposition, snarls may contain multiple variants that cannot be further decomposed into nested snarls using vg deconstruct. If snarls of this kind are compared with truth calls, the evaluation will not be accurate. We solved this problem by comparing reference and alternate allele traversals for each snarl to infer the minimalist representation of variants (Supplementary Fig. 44).

Annotating small variants in pangenome graphs with AFs from gnomAD

Variant sites in MC and PGGB graphs were discovered using vg deconstruct99. The resulting VCF files were then decomposed on the basis of allele traversals (‘Decomposing pangenome graphs based on allele traversals’ subsection). The multi-nucleotide polymorphisms and complex indels were further decomposed into SNPs and simple indels using vcfdecompose –break-mnps –break-indels from RTG tools (v.3.12.1)83, so that they could be annotated with gnomAD later. For comparison, variants called from PacBio HiFi reads using DeepVariant and from haplotype-resolved assemblies using Dipcall were also used. For each discovery method, small variants (<50 bp) were extracted and normalized using bcftools norm -c s -f <reference sequence in FASTA format> -m -any. Next, all per-sample VCF files were combined into one VCF file using bcftools concat -a -D after dropping individual genotype information using bcftools view -G. To annotate small variants with AFs from gnomAD105, the gnomAD (v.3.1.2) per-chromosome VCF files were downloaded and concatenated into one VCF file using bcftools concat. The VCF file was then compressed into a file in the gnotate format using make-gnotate from slivar (v.0.2.7)106 with options –field AC:gnomad_ac –field AN:gnomad_an –field AF:gnomad_af –field nhomalt:gnomad_nhomalt. The small variants were annotated with gnomAD using slivar expr –gnotate <gnotate file>.

Variant benchmarking

Calling variants from PacBio HiFi reads

The PacBio HiFi reads were aligned to the GRCh38 human reference genome with no alternatives using Winnowmap2 (v.2.03)107 with -x map-pb -a -Y -L –eqx –cs. The MD tags required by Sniffles were calculated using samtools calmd. The resulting BAM files were sorted and indexed using SAMtools.

For small variants, the two-pass mode of DeepVariant (v.1.1.0)107 with WhatsHap (v.1.1)108 was used to call SNPs and indels from the PacBio HiFi read alignments. The resulting VCF files were used as truth sets for small variant benchmarking.

Three discovery methods were used to call SVs from the PacBio HiFi read alignments. For PBSV (v.2.6.2)41, SV signatures were identified using pbsv discover with –tandem-repeats <GRCh38 TRF BED file> to improve the calling performance in repetitive regions. SVs were then detected using pbsv call with –ccs –preserve-non-acgt -t DEL,INS,INV,DUP,BND -m 40 from the signatures. For SVIM (v.2.0.0)44, SVs were called using svim alignment with –read_names –zmws –interspersed_duplications_as_insertions –cluster_max_distance 0.5 –minimum_depth 4 –min_sv_size 40. In contrast to PBSV and Sniffles, SVIM outputs all calls no matter their quality. To determine the threshold used for filtering low-quality calls, a precision–recall curve was generated across various quality scores by comparing with the GIAB (v.0.6) Tier 1 SV benchmark set for HG002 (Supplementary Fig. 45). Consequently, SVIM calls with a quality score lower than ten were excluded. For Sniffles (v.1.0.12b)42, SVs were discovered with -s 4 -l 40 -n -1 –cluster –ccs_reads. Unlike PBSV and SVIM, Sniffles does not generate consensus sequences of insertions from aggregating multiple supporting reads. Therefore, Iris (v.1.0.4)43 was used to refine the breakpoints and insertion sequences with –hifi –also_deletions –rerunracon –keep_long_variants. All resulting VCF files were sorted and indexed using BCFtools.

Calling SVs from haplotype-resolved assemblies

Three discovery methods were used to call SVs from the haplotype-resolved assemblies generated using Trio-Hifiasm.

For SVIM-asm (v.1.0.2)45, assemblies were aligned to the GRCh38 human reference genome with no alternatives using minimap2 (v.2.21)86 with -x asm5 -a –eqx –cs and then sorted and indexed using SAMtools. SVs were called using svim-asm diploid with –query_names –interspersed_duplications_as_insertions –min_sv_size 40. The resulting VCF files were sorted and indexed using BCFtools.

For PAV (v.0.9.1)5, assemblies were aligned to the GRCh38 human reference genome with no alternatives using minimap2 (v.2.21)86 with options -x asm20 -m 10000 -z 10000,50 -r 50000 –end-bonus=100 –secondary=no -a –eqx -Y -O 5,56 -E 4,1 -B 5. These alignments are then trimmed to reduce the redundancy of records and to increase the contiguity of alignments. SVs, indels and SNPs were called by using cigar string parsing of the trimmed alignments. Inversion calling in PAV uses a new k-mer density assessment to resolve inner and outer breakpoints of flanking repeats, which does not rely on alignment breaks to identify inversion sites. This is designed to overcome limitations in alignment methodologies and to expand inversion calls, which result in duplications and deletions of sequence on the boundaries.

The Hall-lab pipeline is as documented in the WDL workflow (github.com/hall-lab/competitive-alignment/blob/master/call_assembly_variants.wdl) (commit 830260a). In brief, the maternal and paternal assemblies were aligned to the GRCh38 human reference genome using minimap2 (v.2.1)86 with options -ax asm5 -L –cs. Large indels (>50 bp) were detected using the ‘call_small_variants’ task, based on paftools (v.2.17-r949-dirty). For large SV, breakpoints were mapped based on split alignments of assembly contigs to the reference genome and classified as SVs using a series of custom python scripts in the ‘call_sv’ task. The breakpoint-mapped SVs were then filtered on the basis of the coverage of the reference genome by the assembly contigs (calculated using bedtools genomecov (v.2.28.0)). For each haplotype assembly, a BED file of ‘excluded regions’ was defined comprising genomic regions covered by more than one distinct contig or with more than 3× coverage by a single contig. Breakpoint-mapped SVs where either breakpoint or >50% of the outer span intersected an excluded region were filtered.

Merging SV call sets

To integrate per-sample VCF files generated by three HiFi-based and three assembly-based SV callers, svtools109 was used. For each individual, VCF files from the six callers were jointly sorted and then merged using svtools lsort and lmerge, first using a strict criterion (svtools lmerge -f 20), followed by a more lenient second merge (svtools lmerge -f 100 -w carrier_wt). The autosomal SV calls supported by at least two callers were included in the consensus SV call set for comparison.

Defining confident regions for variant benchmarking

For SVs, confident regions were generated using Dipcall. Although useful for small variants, current benchmarking tools such as hap.py/vcfeval cannot properly compare different representations of small variants in and around SVs. Therefore, for each sample, the confident regions from Dipcall were further processed as follows:

  1. (1)

    Exclude any SD, self-chain, tandem repeat longer than 10 kb or satellite DNA if there are any breaks in the Dipcall BED file in the repeat region +15 kb flanking sequence on each side. The rationale is that breaks in the Dipcall BED file are generally caused by missing sequence or errors in the assembly or reference or by large SVs or CNVs for which we do not have tools to benchmark small variants in these regions.

  2. (2)

    Exclude 15 kb around all breaks in the Dipcall BED file for the same reason as noted above.

  3. (3)

    Exclude 15 kb around all gaps in GRCh38 because alignments are unreliable.

  4. (4)

    Exclude variants >49 bp in the Dipcall VCF file and any tandem repeats overlapping these SVs +50 bp on each side.

Benchmarking variants

Variant sites in MC and PGGB graphs were discovered using vg deconstruct99. Variant sites with alleles larger than 100 kb in MC and PGGB graphs were then removed using vcfbub (v.0.1.0)100 with options -l 0 -a 100000. The resulting VCF files were further processed using vcfwave from vcflib110 with option -I 1000. In brief, vcfwave realigned alternate alleles against the reference allele for each variant site using the bidirectional wavefront alignment (BiWFA) algorithm111 to decompose complex alleles into primitive ones. The multi-sample VCF files were then converted to per-sample VCF files using bcftools view -a -I -s <sample name> and the multiallelic sites were splitted into biallelic records using bcftools norm -m -any. Next the autosomal small variants (<50 bp) from a given pangenome graph (query set) were compared with the HiFi-DeepVariant call set (truth set) using vcfeval from RTG tools (v.3.12.1)83 with options -m annotate –all-records –ref-overlap –no-roc. Note that the multi-nucleotide polymorphisms and complex indels were reduced to SNPs and simple indels using vcfdecompose –break-mnps –break-indels from RTG tools (v.3.12.1)83 The comparison was performed independently for each individual. Recall and precision were calculated within the refined Dipcall confident regions (‘Defining confident regions for variant benchmarking’ subsection) and then stratified using the GIAB (v.3.0) genomic context. To evaluate the SV (≥50 bp) calling performance, the autosomal SVs from a given pangenome graph (query set) were compared to the consensus SV call set (truth set) for each individual using truvari bench (v.3.2.0)112 with options –multimatch -r 1000 -C 1000 -O 0.0 -p 0.0 -P 0.3 -s 50 -S 15 –sizemax 100000 –includebed <Dipcall confident regions>. Recall and precision were then stratified using the GIAB (v.3.0) genomic context and by variant length.

Alignment of long reads to pangenomes

PacBio HiFi reads

PacBio HiFi reads from 44 HPRC samples (excluding the held out samples) were aligned to the MC graph using GraphAligner (v.1.0.13)113 with option -x vg and stored in the GAF format34. For each read that aligned to multiple places in the graph, the alignment with the highest score was retained. To remove low-quality alignments, a read with <80% of read length aligned to the graph was discarded. After filtering the read-to-graph alignments, the read depth of each edge was calculated using vg pack (v.1.33.0)114 with options -Q -1 -D. Note that the resulting GAF files did not contain a mapping quality (encoded as 255 for missing) for each alignment, therefore the option -Q -1 was given to vg pack to ensure that these alignments were used during read-depth calculation. Next, the edges of each sample were classified into either on-target or off-target depending on whether they were on the sample paths (encoded as W-lines in MC GFA files) or not.

Oxford Nanopore reads

ONT reads obtained from 29 HPRC samples (samples labelled HPRC in Supplementary Table 1) were aligned against the MC graph. The alignments were produced using GraphAligner (v.1.0.13) with parameter settings -x vg –multimap-score-fraction 1 –multiseed-DP 1. The number of reads in these datasets range between 1 million and 5.4 million and have an average read length of 28.4 kb. On average, 99.68% of the reads received hits from one or more locations in the graph. For each read, we determined its best hit based on alignment score and discarded all its lower-scoring alignments in subsequent analyses. The alignment identities of these best hits peaked above 95%, with an average ratio of alignment-length-to-read-length (ALRL) of 0.880 (s.d. = 0.302) and average MAPQ value of 59.35. The alignment set was further quality-pruned by discarding alignments that either had an ALRL lower than 0.8 or a MAPQ value lower than 50. The surviving alignments had an overall average ALRL of 0.968 (s.d. = 0.047) and effectuate an overall genome coverage between 10.5-fold and 43-fold across the 29 samples (Supplementary Fig. 46).

Annotating genes within pangenomes

We ran CAT46 to annotate each of the genomes within a pangenome graph. CAT projects a reference annotation, in this case GENCODE (v.38), to each of the haplotypes using the underlying alignments within the graph. CAT (commit eb2fc87) was run on both of the GRCh38-based and the CHM13-based MC graphs . For each graph, the autosomes were first run all together, and then the sex chromosomes were run on the appropriate haplotypes. The parameters used were default parameters, except as shown below. An example CAT command run is:

luigi –module cat RunCat –hal=CHM13-f1g-90-mc-aug11.hal –ref-genome=GRCh38 –workers=8 –config=cat-hprc.gencode38.autosomes.config –work-dir work-hprc-gencode38-chm13 –out-dir out-hprc-gencode38-chm13 –local-scheduler –assembly-hub –maxCores 8 –binary-mode local > cat.hprc.gencode38.autosomes.chm13.log

Comparisons were made between the resulting CAT annotations and those from the Ensembl pipeline by looking at the parent GENCODE identifiers for each gene and transcript in the sets. Numbers of shared and unique identifiers between the sets were tabulated. Because the two annotation sets used slightly different versions of GENCODE (v.38), only those identifiers attempted to be mapped by both pipelines were considered. Additionally, features were considered to be at the same locus if their genomic intervals overlapped.

Identifying medically relevant sites

SV sites in Minigraph were discovered using gfatools bubble. To obtain the number of observed alleles per site, per-sample alleles were called using minigraph -cxasm –call (Supplementary Table 21). SV sites with alleles larger than 10 kb and at least five observed alleles were selected as complex SV sites. The complex SV sites were further filtered on the basis of whether they overlapped with medically relevant protein-coding genes47 using bedtools intersect. To understand whether the medically relevant complex SV sites are known in previous studies, the coverage of SVs from the 1KG call sets10,19 was computed using bedtools coverage -F 0.1 (Supplementary Table 16). All complex SVs were examined using Bandage115 and visually compared to previous short-read SV call sets10,19,68,116 using IGV.

Analysis of five complex loci

Visualization of graph structures of five loci

We extracted subgraphs and paths for five loci in the MC and PGGB graphs using gfabase (v.0.6.0)117 and odgi (v.0.6.2)98 with the following example commands:

gfabase sub GRCh38-f1g-90-mc-aug11.gfab GRCh38.chr1:25240000-25460000 –range –connected –view –cutpoints 1 –guess-ranges -o RH_locus.walk.gfa
odgi extract -i chr1.pan.smooth.og -o chr1.pan.RH_locus.og -b chr1.RH_locus.bed -E -P

We then visualized the graph structures of the subgraphs using Bandage (v.0.8.1)115.

Alignment of genes to graphs

We aligned Ensembl (release 106)77 GRCh38 version gene sequences to the MC graph and PGGB graph using GraphAligner (v.1.0.13)113 with parameter settings -x vg –try-all-seeds –multimap-score-fraction 0.1 to identify the gene positions within the graphs. To show locations of genes on Bandage plots, we applied colour gradients from green to blue to the nodes of each gene. Lines alongside the Bandage plots showing approximate gene positions, exons and transcription start sites based on Ensembl Canonical transcripts were drawn by hand.

Structural haplotypes identification

Sequences of each assembly are represented by paths in a GFA file. We identified SVs in each assembly by tracing these paths through different ‘big’ bubbles (>5,000 bp) in either the MC graph or PGGB graph within those gene regions. We selected the 5,000 bp bubble size based on manual inspection of Bandage plots. An example command to identify big bubbles at a RH locus is as follows:

bcftools filter hprc-v1.0-mc-grch38.vcf.gz -r chr1:25240000-25460000 | grep LV=0 | awk ‘{OFS=“\t”; print $1,$2,$3,$4,$5}’ | tr “,” “\t” | awk ‘{OFS=“\t”; for (i=1;i<=NF;i++) {len=length($i); if (len>5000) {print $1,$2,$3; next}}}’

To identify gene conversion events (as gene conversions are not shown as bubbles in the graphs), we identified nodes that were different between a gene and its homologous gene (for example, RHD and RHCE) based on the GraphAligner alignments described above. We refer to these as paralogous sequence variants. A gene conversion event was detected if a path of a gene goes through more than four paralogous sequence variants of its homologous gene in a row.

Visualization of linear gene structures

We counted the number of assemblies for each structural haplotype and computed their frequency. We visualized linear haplotype structures (for example, in Fig. 5c) using gggenes (v.0.4.1)118 based on the structural haplotypes determined for each assembly from the pangenome graphs. The length of intervals between genes is fixed (except for TMEM50A and RHCE, because those two genes are immediately next to each other). Lengths of genes are shown as proportional to gene lengths in GRCh38.

Pangenome point genotyping

Alignment of reads to the pangenome

The short reads were first split into chunks to parallelize the read mapping to the ‘allele-filtered graph’ pangenome, as defined in the ‘MC’ subsection. This pangenome is included within the dataset accompanying this paper and can be identified as ‘clip.d9.m1000.D10M.m1000’. Mapping was performed with Giraffe49 from vg release v.1.37.0. For trio-based runs, the trio-sample sets of short reads were mapped to the pangenome using Giraffe from vg release v.1.38.0. Note that the core vg algorithms for Giraffe mapping and surjection (conversion from graph space to linear space) are the same in both vg v.1.37.0 and v.1.38.0. The output alignments, surjected to GRCh38 in BAM format as explained below, are available at s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/samples in the bam directory of each sample’s directory, and are organized by aligner.

Surjection to GRCh38 and indel realignment

To perform variant calling, GAM alignments were surjected onto the chromosomal paths from GRCh38 (chromosomes 1–22, X and Y) using vg surject and the –prune-low-cplx option to prune short and low-complexity anchors during realignment. The BAM files were sorted and split by chromosome using SAMtools (v.1.3.1)119. The reads were realigned, first using bamleftalign from FreeBayes (v.1.2.0)120, and then with ABRA (v.2.23)121 on target regions that were identified using RealignerTargetCreator from GATK (v.3.8.1)122 and expanded by 160 nucleotides with bedtools slop (v.2.21.0)123.

Model training

To perform variant calling with DeepVariant and DeepTrio, we trained machine-learning models specific to our graph reference and Giraffe alignment pipeline based on our alignments. For all models, chromosome 20 was entirely held out from all input samples to provide a control.

Training was performed on Google’s internal cluster, using unreleased Google tensor processing unit (TPU) accelerators, from a cold start (that is, without using a pre-trained model as input). We believe that nothing about the way in which we executed the training is essential to the results obtained. Cold start training is estimated to be feasible outside the Google environment; therefore the claims we present here are falsifiable, but it is not expected to be cost-effective. Researchers looking to independently replicate our training should consider doing warm start training from a base model trained on other data, using commercially available graphics processing unit (GPU) accelerators. An example procedure can be found in the DeepVariant training tutorial at GitHub (github.com/google/deepvariant/blob/r1.3/docs/deepvariant-training-case-study.md). We predict that this more accessible method would produce equivalent results.

For both DeepVariant and DeepTrio, the true variant calls being trained against came from the GIAB benchmark (v.4.2.1).

For DeepVariant, we trained on the HG002, HG004, HG005, HG006 and HG007 samples, with HG003 held out. The trained DeepVariant model is available at s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepVariant/models/DEEPVARIANT_MC_Y1.

For DeepTrio, we trained two sets of models: one on HG002, HG003, HG004, HG005, HG006 and HG007, with HG001 held out; and one on HG001, HG005, HG006 and HG007, with the HG002, HG003 and HG004 trio held out. Each DeepTrio model set included parental and child models. The two trained child deeptrio models are available at s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio/child and s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio-no-HG002-HG003-HG004/child, respectively. The two trained parental DeepTrio models are available at s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio/parent and s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/models/deeptrio-no-HG002-HG003-HG004/parent, respectively.

Variant calling with DeepVariant

DeepVariant (v.1.3) was evaluated on HG003, using the model we trained with HG003 held out (see ‘Model training’). We used the –keep_legacy_allele_counter_behavior flag (introduced to support this analysis) and a minimum mapping quality of 1 in the make_examples step, before calling the variants with call_variants. Both VCFs and gVCFs were produced. The WDL workflow used for single sample mapping and variant calling was deposited into Zenodo (doi.org/10.5281/zenodo.6655968).

Variant calling on GRCh38 with BWA-MEM and DeepVariant

Small variants were also called using a more traditional pipeline. We aligned reads with BWA-MEM50 to GRCh38 with decoys but no ALTs. DeepVariant then called small variants from the aligned reads. The same version and parameters were used for DeepVariant. Only the model was changed to the default DeepVariant model.

Variant calling with DeepTrio

Small variants were also called using DeepTrio (v.1.3). For HG001, we used the DeepTrio models we trained with HG001 held out (see ‘Model training’). For the HG002, HG003 and HG004 trio and HG005, HG006 and HG007 trio, we used the models trained with the HG002, HG003 and HG004 trio held out; the HG005, HG006 and HG007 trio (except for chromosome 20) was still included in the training set. We used the –keep_legacy_allele_counter_behavior and a minimum mapping quality of 1 in the make_examples step before calling the variants with call_variants. Both VCFs and gVCFs were produced and are available at s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/PANGENOME_2022/DeepTrio/samples in the vcf directory of each sample’s directory, and are organized by mapping and calling condition. The WDL workflow used for trio-based mapping and variant calling was deposited into Zenodo (doi.org/10.5281/zenodo.6655962).

Variant calling on GRCh38 with BWA-MEM, Dragen Graph and DeepTrio

For DeepTrio, small variants were also called using a more traditional pipeline and a graph-based implementation of Illumina’s Dragen platform (v.3.7.5). The conditions evaluated were each a combination of a mapper and a reference. The Giraffe-HPRC condition used Giraffe (v.1.38.0)49 to align reads to the HPRC reference. The BWA-MEM condition used BWA-MEM (v.0.7.17-r1188)50 to align reads to the hs38d1 human reference genome with decoys but no ALTs. The Dragen-DeepTrioCall condition used Illumina’s Dragen platform (v.3.7.5)51 against their default graph, which was constructed using the same GRCh38 reference with decoys but no ALTs, and population contigs, SNPs and liftover sequences from datasets internal to their platform. DeepTrio then called small variants from the aligned reads. The same version and parameters were used for DeepTrio (v.1.3). Only the default model was used for these conditions. We also applied the native Dragen caller and joint genotyper to the Dragen-Graph-based alignments for comparison purposes, referred to as Dragen-DragenCall and Dragen-DragenJointCall, respectively. Dragen-DragenCall implements a single-sample based method and is what is the default use-case for processing Dragen-Graph-mapped data. Dragen-DragenJointCall uses a pedigree-backed implementation that informs which variants are likely to be de novo and which are erroneous given the genotype information of the parents. To make a fairer comparison with Dragen, we tested these configurations to assess what implementation of Dragen variant calling produced the best results given the available trio data.

Evaluation using the GIAB benchmark

The small variant calls were evaluated using HG001–HG007 with the GIAB benchmark (v.4.2.1)124, on HG002 in challenging medically relevant autosomal genes47, and on HG002 using a preliminary draft assembly-based benchmark. For the draft assembly-based benchmark, we used Dipcall38 to align a scaffolded, high-coverage Trio-Hifiasm assembly21,38 to GRCh38 and call variants, and then we excluded structurally variant regions from the dip.bed file as described above for the benchmarking of small variants from the pangenome graph. The comparison between the call sets and truth set was made with RTG’s vcfeval83 and Illumina’s hap.py tool82 on confident regions of the benchmark. We used high-coverage read sets of the GIAB HG001, HG002 and HG005 trio child samples and evaluated performance within the held-out chromosome 20 for the GIAB (v.4.2.1) truth set, or the entire genome for the reduced truth set of the challenging medically relevant autosomal genes. The evaluation was also stratified using the set of regions provided by the GIAB at GitHub (github.com/genome-in-a-bottle/genome-stratifications)125.

Variant calls across samples from the 1KG

We applied our small variant calling pipeline to the high-coverage read sets for the 3,202 samples of the 1KG19. The output alignments, in the GAM format, and the VCFs were saved in public buckets at console.cloud.google.com/storage/browser/brain-genomics-public/research/cohort/1KGP/vg/graph_to_grch38. We selected 100 trios among those samples to further evaluate the quality of the calls. We tested all variants that have at least one alternative allele in a trio for Mendelian consistency. In addition, for each variant, we only considered trios for which the child’s genotype was different from the genotype of at least one of the parents to minimize bias created by systematic calls (for example, all homozygous or all heterozygous). We looked at the fraction of variants–trios that failed Mendelian consistency in the entire genome and in sites that do not overlap simple repeats as defined by the ‘simpleRepeat’ track downloaded from the UCSC Genome Browser. The results were compared with Mendelian consistency of calls provided by the 1KG that used GATK HaplotypeCaller on the reads aligned to GRCh38. We also repeated this analysis on the two trios of the GIAB (v.4.2.1) benchmark (HG002–HG007) and across the different methods of our evaluation described above (BWA-MEM and DragenGraph mappers; DeepVariant, DeepTrio and Dragen variant callers).

SV genotyping with PanGenie

VCF preprocessing

We used a VCF file created on the basis of snarl traversal of the MC graph as a basis for genotyping. The records contained in this VCF represent bubbles in the underlying pangenome graph and their nested variants, derived from the snarl tree. Each variant was marked according to their level in this tree. Variants annotated by ‘LV=0’ correspond to the top-level bubbles. We used vcfbub (v.0.1.0)100 with parameters -l 0 and -r 100000 to filter the VCF. This removed all non-top-level bubbles from the VCF unless they were nested inside a top-level bubble with a reference length exceeding 100 kb; that is, top-level bubbles longer than that are replaced by their child nodes in the snarl tree. The VCF also contained the haplotypes for all 44 assembly samples, representing paths in the pangenome graph. We additionally removed all records for which more than 20% of all 88 haplotypes carried a missing allele (“.”). This resulted in a set of 22,133,782 bubbles. In a next step, we used PanGenie (v.1.0.0)54 to genotype these bubbles across all 3,202 samples from the 1KG based on high-coverage Illumina reads19.

Decomposition of variants

We genotyped all top-level bubbles across all 1KG samples. Whereas biallelic bubbles can be easily classified representing SNPs, indels or SVs, this becomes more difficult for multiallelic bubbles contained in the VCF. In particular, larger multiallelic bubbles can contain a high number of nested variant alleles overlapping across haplotypes, represented as a single bubble in the graph. This is especially problematic when comparing the genotypes computed for the entire bubble to external call sets, as coordinates of the bubble do not necessarily represent the exact coordinates of individual variant alleles carried by a sample in such a region (Supplementary Fig. 20).

To tackle this problem, we implemented a decomposition approach that aimed to detect all variant alleles nested inside multiallelic top-level records. The idea was to detect variants from the node traversals of the reference and alternative alleles of all top-level bubbles. Given the node traversals of a reference and alternative path through a bubble, our approach was to match each reference node to its leftmost occurrence in the alternative traversal, resulting in an alignment of the node traversals (Supplementary Fig. 21a). Nested alleles could then be determined based on indels and mismatches in this alignment. As the node traversals of the alternative alleles can visit the same node more than once (which is not the case for the reference alleles of the MC graph), this approach is not guaranteed to reconstruct the optimal sequence alignment underlying the nodes in these repeated regions.

As an output, the decomposition process generated two VCF files. The first one is a multiallelic VCF that contains exactly the same variant records as the input VCF, just that annotations for all alternative alleles of a record were added to the identifier tag in the INFO field. For each alternative allele, the identifier tag contains identifiers encoding all nested variants it is composed of, separated by a colon. The second VCF is biallelic and contains a separate record for each nested variant identifier defining reference and alternative allele of the respective variant (Supplementary Fig. 21b). Both VCFs are different representations of the same genomic variation, that is, before and after decomposition. We applied this decomposition method to the MC-based VCF file, used the multiallelic output VCF as input for PanGenie to genotype bubbles, and used the biallelic VCF as well as the identifiers to translate PanGenie’s genotypes for bubbles to genotypes for all individual nested variant alleles. All downstream analyses of the genotypes are based on this biallelic representation (that is, after decomposition).

Although the majority of short bubbles (<10 bp) are biallelic, particularly large bubbles (>1,000 bp) tend to be multiallelic. Sometimes each of the 88 non-reference (neither GRCh38 nor T2T-CHM13) haplotypes contained in the graph covered a different path through such a bubble (Supplementary Fig. 22a), leading to a VCF record with 88 alternative alleles listed. We determined the number of variant alleles located inside biallelic and multiallelic bubbles in the pangenome after decomposition. As expected, the majority of SV alleles was located inside of the more complex, multiallelic regions of the pangenome (Supplementary Fig. 22b).

Genotyping evaluation based on assembly samples

We conducted a leave-one-out experiment to evaluate PanGenie’s genotyping performance for the call set samples. For this purpose, we repeatedly removed one of the panel samples from the MC VCF and genotyped it using only the remaining samples as an input panel for PanGenie. We later used the genotypes of the left-out sample as ground truth for evaluation. We repeated this experiment for five of the call set samples (HG00438, HG00733, HG02717, NA20129 and HG03453) using 1KG high-coverage Illumina reads19. PanGenie is a re-genotyping method. Therefore, like every other re-typer, it can only genotype variants contained in the input panel VCF, that is, it is not able to detect variants unique to the genotyped sample. For this reason, we removed all variant alleles (after decomposition) unique to the left-out sample contained in the truth set for evaluation. To evaluate the genotype performance, we used the weighted genotype concordance54. Extended Data Fig. 9 shows the results stratified by different regions. Extended Data Fig. 9a shows concordances in biallelic and multiallelic regions of the MC VCF. The biallelic regions include only bubbles with two branches. The multiallelic regions include all bubbles in which haplotypes cover more than two different paths. Extended Data Fig. 9b shows the same results stratified by genomic regions defined by GIAB that we obtained from the following genotypes: easy (ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/union/GRCh38_notinalldifficultregions.bed.gz); low-mappability (ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/union/GRCh38_alllowmapandsegdupregions.bed.gz); repeats (ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/LowComplexity/GRCh38_AllTandemRepeats_gt100bp_slop5.bed.gz); and other-difficult (ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.0/GRCh38/OtherDifficult/GRCh38_allOtherDifficultregions.bed.gz).

Here and in the following, we considered results for SNPs, indels (1–49 bp), SV deletions, SV insertions and other SV alleles, defined as follows: SV deletions include all alleles for which length(REF) ≥50 bp and length(ALT) = 1 bp; SV insertions include all alleles for which length(REF) = 1 bp and length(ALT) ≥ 50 bp; all other alleles with a length ≥50 bp are included in ‘others’.

Overall, weighted genotype concordances were high for all variant types. In particular, variant alleles in biallelic regions of the graph were easily genotypable. Alleles inside multiallelic bubbles were more difficult to genotype correctly as PanGenie needs to decide between several possible alternative paths, whereas there are only two such paths for biallelic regions (Extended Data Fig. 9a). Furthermore, genotyping accuracy depended on the genomic context (Extended Data Fig. 9b). Regions with low mappability, repetitive regions and other difficult regions were harder to genotype than regions classified as ‘easy’ by GIAB.

Creating a high-quality subset

We generated genotypes for all 3,202 1KG samples with PanGenie and defined a high-quality subset of SV alleles that we could reliably genotype. For this purpose, we applied a machine-learning approach similar to what we have previously presented5,54. We defined positive and negative subsets of variants based on the following filters: ac0_fail, a variant allele was genotyped with an AF of 0.0 across all samples; mendel_fail, the mendelian consistency across trios is <80% for a variant allele. Here, we use a strict definition of Mendelian consistency, which excludes all trios with only 0/0, only 0/1 and only 1/1 genotypes; gq_fail, <50 high-quality genotypes were reported for this variant allele; self_fail, genotyping accuracy of a variant allele across the panel samples is <90%; nonref_fail, not a single non-0/0 genotype was genotyped correctly across all panel samples.

The positive set included all variant alleles that passed all five filters. The negative set contained all variant alleles that passed the ac0_fail filter but failed at least three of the other filters. We trained a support vector regression approach based on these two sets that used multiple features, including AFs, Mendelian consistencies or the number of alternative alleles transmitted from parents to children. We applied this method to all remaining variant alleles genotyped with an AF > 0, resulting in a score between –1 (bad) and 1 (good) for each. We finally defined a filtered set of variants that included the positive set and all variant alleles with a score of ≥ –0.5.

We show the number of variant alleles contained in the unfiltered set, the positive set and the filtered set in Supplementary Table 18. As our focus is on SVs and as 65% of all SNPs and indels are already contained in the positive set, we applied our machine-learning approach only to SVs. We found that 50%, 33% and 26% of all deletion, insertion and other alleles, respectively, were contained in the final, filtered set of variants. Note that these numbers take all distinct SV alleles contained in the call sets into account. Especially for insertions and other SVs, many of these alleles are highly similar, with sometimes only a single base pair differing. Therefore, it is probable that many of these actually represent the same events. Our genotyping and filtering approach helped to remove such redundant alleles.

To evaluate the quality of the PanGenie genotypes, we compared the AFs observed for the SV alleles across all 2,504 unrelated 1KG samples to their AFs observed across the 44 assembly samples in the MC call set. Supplementary Figs. 23–25 show the results for SV deletions, insertions and other SV alleles. We observed that the AFs between both sets matched well, resulting in Pearson correlations of 0.93, 0.87 and 0.81 for deletions, insertions and other alleles, respectively, contained in the unfiltered set. For the filtered set, we observed correlations of 0.96, 0.93 and 0.90, respectively. We also analysed the heterozygosity of the PanGenie genotypes across all 2,504 unrelated 1KG samples and observed a relationship close to what is expected by Hardy–Weinberg equilibrium (Supplementary Figs. 2325, lower panels).

Number of SVs per sample

We compared our filtered set of variant alleles to the HGSVC PanGenie genotypes (v.2.0 lenient set)5 and Illumina-based SV genotypes19. A direct comparison of the three call sets is difficult. The HGSVC and HPRC call sets are based on variant calls produced from haplotype-resolved assemblies of 32 and 44 samples, respectively5. For each call set, variants were re-genotyped across all 3,202 1KG samples. Note that the call set samples for HPRC and HGSVC are disjoint. As re-genotyping cannot discover new variants, both call sets will miss variants carried by 3,202 samples that were not seen in the assembly samples. By contrast, the 1KG call set contains short-read based variant calls produced for each of the 3,202 1KG samples. Another difference between the HGSVC and HPRC call sets is that in the HGSVC call set, highly similar alleles are merged into a single record to correct for representation differences across different samples or haplotypes. The HPRC call set, however, keeps all these alleles separately even if there is only a single base pair difference between them. To make the call sets better comparable, we merged clusters of highly similar alleles in the HPRC filtered set before comparisons with other call sets. This was done with Truvari (v.3.1.0)112 using the command: truvari collapse -r 500 -p 0.95 -P 0.95 -s 50 -S 100000.

To be able to properly compare the call sets despite their differences, we counted the number of SV alleles present in each sample (genotype 0/1 or 1/1) in each call set and plotted the corresponding distributions stratified by genome annotations from GIAB (same as above, Fig. 6d). We also generated the same plot including only common SV alleles with an AF > 5% across all 3,202 samples (Supplementary Fig. 28). Both plots showed that both assembly based call sets (HPRC and HGSVC) were able to access more SVs across the genome than the short-read-based 1KG call set, especially deletions <300 bp and insertions (Fig. 6e). This result confirms that SV callers based on short reads alone miss a large proportion of SVs located in regions inaccessible by short-read alignments, which has been previously reposed by several studies5,8. In the ‘easy’ regions, the number of SVs per sample was consistent across all three call sets. For the other regions, however, results indicated that the HPRC-filtered genotypes provided access to more variant alleles than the HGSVC lenient set, especially insertions and variants in regions of low mappability and tandem repeats (Fig. 6d,e).

To evaluate the new SVs in our filtered HPRC call set, we revisited the leave-one-out experiment we had previously performed on the unfiltered set of variants (see above). We restricted the evaluation to the following subset of variants: (1) those that are in our filtered set but not in the 1KG Illumina calls (novel); (2) those in our filtered set and in the 1KG Illumina call set (known); and (3) all variants in our filtered set. To find matches between our set and the Illumina calls, we used a criterion based on reciprocal overlap of at least 50%. Results are shown in Supplementary Fig. 29. We generated two versions of this figure: the first one (top) excludes variants that are unique to the left-out sample and therefore not typable by any re-genotyping method, and the second one includes these variants (bottom). In general, genotype concordances of all lenient variants (brown, dark purple) were slightly higher compared with the concordances we observed for the unfiltered set (Supplementary Fig. 29). Furthermore, concordances of the known variants were highest. This is expected, as these variants tended to be in regions easier to access by short reads. Concordances for novel variants were slightly worse. This was also expected, as these variants tended to be located in more complex genomic regions that are generally harder to access. However, even for these variants, concordances were still high, which indicated that the PanGenie genotypes for these variants are of high quality.

Evaluation based on medically relevant SVs

In addition to all 3,202 1KG samples, we genotyped sample HG002 based on Illumina reads from ref. 18. We used the GIAB CMRG benchmark containing medically relevant SVs47 downloaded from ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/AshkenazimTrio/HG002_NA24385_son/CMRG_v1.00/GRCh38/StructuralVariant/ for evaluation. Similar to the 1KG samples, we used the MC-based VCF (see above) containing variant bubbles and haplotypes of 44 assembly samples as an input panel for PanGenie. We extracted all variant alleles with a length ≥50 bp from our genotyped VCF (biallelic version, after decomposition). We converted the ground truth VCF into a biallelic representation using bcftools norm -m -any and kept all alleles with length ≥50 bp. We used Truvari (v.3.1.0)112 with parameters –multimatch –includebed <medically-relevant-sv-bed> -r 2000 –no-ref a -C 2000 –passonly to compare our genotype predictions to the medically relevant SVs. Results are shown in Supplementary Table 22. As PanGenie is a re-typing method, it can only genotype variants provided in the input and therefore cannot detect novel alleles. As HG002 is not among the panel samples, the input VCF misses variants unique to this HG002 sample. Thus, these unique variants cannot be genotyped by PanGenie and will be counted as false negatives during evaluation. Therefore, we computed an adjusted version of the recall that excluded SV alleles unique to HG002 (that is, alleles not in the graph) from the truth set for evaluation. To identify which SV alleles were unique, we compared each of the 44 panel samples to the ground truth VCF using Truvari to identify the false negatives for each sample. Then we computed the intersection of false-negative calls across all samples. The resulting set then contained all variant alleles unique to the HG002 ground truth set. We found 15 such unique SV alleles among the GIAB CMRG variants. We removed these alleles from the ground truth set and recomputed precision–recall statistics for our genotypes. Adjusted precision–recall values are shown in Supplementary Table 22.

All genotyping results produced by PanGenie are available at Zenodo (doi.org/10.5281/zenodo.6797328).

Read mapping at VNTR regions

Simulating and mapping VNTR reads

Raw VNTR coordinates on GRCh38 (chromosomes 1–22 and sex chromosomes only) were generated using TRF (v.4.09)56 with command trf hg38.fa 2 7 7 80 10 50 500 -f -d. Only repeats with a period size between 6,000 and 10,000 bp, total length >100 bp and not overlapping with centromeric regions were selected, leaving a total number of 98,021 non-overlapping loci. Using the raw VNTR coordinates on GRCh38 as input, VNTR regions across 96 haplotypes (including GRCh38 and CHM13) were annotated using the build module in danbing-tk (v.1.3)55 (dist_scan=700, dist_merge=1, TRwindow=100000, MBE_th1=0.3, MBE_th2=0.6).

Whole-genome paired-end error-free short reads were simulated at around 30× for each genome, or equivalently about 15× for each haplotype. A read pair was generated for every 20 bp with fragment size of 500 bp and read length of 150 bp. Paired-end read mapping to the MC graph was done using Giraffe (v.1.39.0)49 using the command vg giraffe -x $pref.xg -g $pref.gg -H $pref.gbwt -m $pref.min -d $pref.dist -p -f <(zcat $h1 $h2) -i -t 16, whereas mapping to GRCh38 was done using BWA-MEM (v.0.7.17-r1188)50 using the command bwa mem -t 16 -Y -K 100000000 -p $ref <(zcat $h1 $h2). For a fair comparison, GRCh38 plus decoy minus ALT/HLA contigs were used as reference to match the paths included in the MC graph.

Evaluating read mapping accuracy at VNTR regions

To evaluate the performance of read mapping using the MC graph plus Giraffe, the VNTR information from danbing-tk were used to annotate each node in the graph by traversing each haplotype path. Every node that covers a VNTR region has a tuple that denotes the intersected interval; any aligned reads overlapping with the interval were considered mapped to the VNTR. Similarly, a read simulated from an interval overlapping with a VNTR was considered derived from the VNTR. To evaluate the performance of GRCh38 plus BWA-MEM, the mapped region by each read was obtained using the bamtobed submodule in BEDTools (v.2.30.0)123. The VNTR annotations on GRCh38 were used to determine whether a read was mapped to a VNTR.

For each read, we tracked its source and mapped VNTR and VNTRs, and used this information to compute accuracy. Only VNTRs present in danbing-tk’s annotations were tracked; otherwise they were labelled untracked, the same as non-VNTR regions. A true positive denotes mapping from a VNTR to its original VNTR. An exogenous false positive denotes mapping from untracked regions to a VNTR. An endogenous false positive denotes mapping from a VNTR to another VNTR. A true negative denotes mapping from untracked regions to untracked regions. A false negative denotes mapping from a VNTR to untracked regions (Supplementary Fig. 30). Any alignments in the JSON output of Giraffe that did not contain the mapping field were considered unmapped. The two ends of a read pair that did not map to the same chromosome by BWA-MEM were also considered unmapped.

Estimating VNTR length variants from read depths

The WGS samples for 35 genomes (HG00438, HG00621, HG00673, HG00733, HG00735, HG01071, HG01106, HG01109, HG01175, HG01243, HG01258, HG01361, HG01891, HG01928, HG01952, HG01978, HG02055, HG02080, HG02145, HG02148, HG02257, HG02572, HG02622, HG02630, HG02717, HG02723, HG02818, HG02886, HG03098, HG03453, HG03486, HG03492, HG03579, NA18906 and NA19240) were mapped to the MC graph using Giraffe as described in the ‘Pangenome point genotyping’ subsection. Using the VNTR annotations described in the previous section, the number of reads mapped to each VNTR region in the MC graph was calculated as a proxy for VNTR length. VNTRs with invariant length across the 35 genomes were removed from analysis, leaving a total of 60,861 loci. The ground truth for a VNTR in a genome was computed from the number of bases spanned by the VNTR, averaged from the two haplotypes.

As a baseline control, the read depth of each VNTR region for the 35 WGS samples produced by mapping reads to GRCh38 was also computed with mosdepth (v.0.3.1)126 using the command mosdepth -t 4 -b $VNTR_bed -x -f $hg38 $pref $cram. To be able to compare with the graph-based approach, VNTRs with missing annotation on GRCh38 were further removed, leaving a total of 60,386 VNTRs.

RNA-seq mapping evaluation

We augmented the allele-filtered graph (see the ‘Pangenome point genotyping’ subsection) with edges for splice junctions to create a spliced pangenome graph using the rna subcommand in the vg toolkit (v.1.38.0) with a maximum node length set to 32 (vg rna -k 32)57. The transcript annotations that were used to define the splice junctions consisted of the CAT transcript annotations on each assembly together with splice junctions from the GENCODE (v.38) annotation29. Transcripts from the GENCODE (v.38) annotation were further added as paths to the spliced pangenome graph. For comparison, we also created a spliced reference constructed from the reference sequence, once again using the GENCODE (v.38) transcript annotation. For both graphs, we created the indices needed for mapping using the vg toolkit (v.1.38.0) with default parameters, except when pruning, for which edges on embedded paths were restored (vg prune -r). Furthermore, for the spliced HPRC pangenome graph, it was necessary to use stricter pruning parameters (vg prune -r -k 64 -M 64). For the spliced reference, we also created the index needed by the RNA-seq mapper STAR using default parameters.

We simulated RNA-seq reads with a pipeline that was designed to preserve complex genome variation in the simulated data. The transcript sequences used for the simulation were derived from the GENCODE (v.38) transcript annotations projected onto assembled haplotypes from HG002. Specifically, we used MC to create an alignment between GRCh38 and the two HPRC HG002 assembly haplotypes, which were held out of the main pangenome graph for benchmarking. We then used CAT to lift the transcript annotations over to these haplotypes. We constructed a spliced personal genome graph using the vg rna subcommand, and then we simulated reads using vg sim (commit 2cea1e2) using an Illumina NovaSeq cDNA read set (SRR18109271) to fit model parameters. This essentially amounts to simulating directly from the projected transcript sequences. The transcripts were simulated with uniform expression, split evenly between the two haplotypes, keeping the reads from each haplotype separate. This expression profile is not biologically realistic, but it avoids the difficulty of choosing a particular expression profile as representative for all tissues and stages of development. Moreover, existing estimated profiles would be biased towards the tools that were used to estimate them. We simulated 5,000,000 paired-end 150 bp RNA-seq reads.

Both simulated and real Illumina RNA-seq reads were mapped to the graphs using vg mpmap (commit c0c4816) with default parameters. In addition, the reads were mapped to the spliced reference using STAR (v.2.7.10a) with default parameters58. For the real data, we used previously published data NA12878 RNA-seq data (SRR1153470)127 and the ENCODE project (ENCSR000AED, replicate 1)128,129.

We used the same approach as previously described57 to evaluate the alignments. In brief, for the simulated data, the graph alignments were compared with the truth alignments by estimating their overlap on the reference genome paths. The graph alignments were projected to the reference paths using vg surject -S. A uniquely aligned read (one with primary MAPQ ≥ 30) was considered correct if it overlapped 90% of the truth alignment. A multi-mapped read was considered correct if any of the multi-mappings was correct under the same criterion. For the real data, the average read coverage of each exon on the reference path calculated from the projected graph alignments were compared with the corresponding coverages estimated from long-read alignments. For the long-read data, we used PacBio Iso-Seq alignments from the ENCODE project (ENCSR706ANY, all replicates), which come from the same cell line as the Illumina data. The long-read alignments were used to define the exons, and only primary long-read alignments with a mapping quality of at least 30 were used. The alignments for the four Iso-Seq replicates (ENCFF247TLH, ENCFF431IOE, ENCFF520MMC and ENCFF626GWM) were combined and filtered using SAMtools (v.1.15)130. BEDTools (v.2.30.0) was used to convert the alignments to exons coordinates123.

We also used the results of the mapping experiment to quantify allelic bias. We used vg deconstruct to identify sites of variation in the MC graph of haplotypes from HG002, with deletions greater than 10 kb removed to avoid spurious variants. Variants overlapping exons were selected and normalized using BCFtools (v.1.16)130. Next we counted the number of mapped reads from each of the two haplotypes that overlapped each heterozygous exonic variant57. Specifically, the read count for each allele was calculated as the average count across the two breakpoints of an allele. This was done to treat different variant types and lengths equally. We then tested all variants with a read coverage of at least 20 for allelic bias using a two-sided binomial test at α = 0.01. All tests that reject the null hypothesis are false positives, as the reads were simulated without allelic bias. The results were split into different classes of variants and plotted against the number of sites that achieved a coverage of at least 20, which is a rough indicator of mapping sensitivity. Indels larger than 50 bp were excluded.

We also compared gene expression inference using the mapped reads. For the vg mpmap (v.1.43.0) graph mappings, we used RPVG (commit 1d91a9e)57 in transcript inference mode to quantify expression. Two different transcript annotations were used as input to RPVG. The MC pantranscriptome was created from the CAT transcript annotations on each assembly and the GENCODE (v.38) transcriptome. The mpmap-RPVG pipeline was compared with Salmon (v.1.9.0)131 and RSEM (v.1.3.3)132. These methods were provided the GENCODE (v.38) transcriptome as input, both with and without transcripts on the GRCh38 alternative contigs. Any genes unique to the alternative contigs were filtered. For Salmon, the GRCh38 reference was used as a decoy and duplicate transcripts were kept. Bowtie2 (v.2.4.5)133 was used as a mapper for RSEM. Gene expression values were calculated by summing the corresponding transcript level values for each gene. GffRead (v.0.12.7) was used to create a table of gene names and transcript identifiers from the transcript annotation134. The accuracy was measured in two ways: (1) the Spearman correlation between the simulated and inferred expression values, and (2) the mean absolute relative difference between the simulated and inferred expression values.

The scripts that were used for graph construction, read simulation, mapping and evaluation are available at GitHub (github.com/jonassibbesen/hprc-rnaseq-analyses-scripts).

ChIP-seq analysis

We aligned H3K4me1, H3K27ac and ATAC-seq obtained from monocyte-derived macrophages from 30 individuals59 using vg map13 to the GRCh38 reference genome graph and to the HPRC genome graph. Then we called peaks using Graph Peak Caller (v.1.2.3)135 on both sets of alignments for each of the 30 H3K4me1, H3K27ac and ATAC-seq samples. To identify HPRC-only peaks, we projected HPRC coordinates to the GRCh38 path using Graph Peak Caller and compared intervals using BEDTools123. We named HPRC peaks that overlapped GRCh38 peaks as common peaks and those that do not as HPRC-only. We calculated the expected frequency (as inverse cumulative distributions) of common and HPRC-only peaks among the 30 samples by resampling the peaks of each sample from the peaks of all the samples and re-counted the number of overlaps. We repeated this simulation 100 times and plotted the average curves. We determined heterozygous variants in our samples by aligning WGS datasets for each sample to the HPRC graph using vg map and genotyping the variants using vg call -a. We narrowed the list of heterozygous SVs above 50 bp in each sample with the aim of looking for allelic-specific peaks. For each epigenomic sample, we obtained allelic-specific read counts within peaks that lie on the previously identified loci by running vg call -a on the epigenomic HPRC alignments, which outputs the numbers of reads on each path in a bubble (DP and AD fields in the VCF output). We then assigned peaks to the SV or reference allele, or both alleles with a two-tailed binomial test parameterized on the sum of reads on both alleles and P = 0.05. Any peak with reads on one allele, but not on the other was assigned to the allele with the reads. Read counts were proportionally adjusted for the difference in length between the reference and SV alleles.

Processed data, scripts and code for the above steps are available at Zenodo (doi.org/10.5281/zenodo.6564396).

Population genetic analyses

Although the size of the population sample represented in our pangenome is small, it provides access to previously under-ascertained regions of the genome. We sought to understand the potential utility of these regions for future population genetic studies using regional PCA based on variants called compared to the CHM13 and GRCh38 references. For these analyses, we considered both the PGGB (whole pangenome, combined) and MC (reference-based, distinct CHM13 and GRCh38) graphs. For both graph models, the CHM13 VCFs provide access to regions that were not previously observed by studies based on GRCh38, for which short-read-based studies may have difficulty reliably aligning and calling variants. In combination, these two graphs provide cross-validation of implied population genetic patterns in these new regions, which we explore here.

To understand chromosome-specific patterns of variation, we applied PCA to each autosomal chromosome independently to the VCFs from PGGB (PGGB-CHM13, PGGB-GRCh38).

To ensure that observed patterns were not derived from higher rates of assembly error in the repetitive regions of acrocentric p-arms, we used our Flagger-confident region annotations to prune the PGGB graph (using odgi inject to inject the confident regions as subpaths and then odgi prune to remove the full original paths that were including unreliable regions) to only confident regions of assemblies. We then reapplied vg deconstruct to this graph to obtain a new set of SNPs (the code for the PGGB graphs pruning and variant calling on the pruned graphs can be found at the following link: github.com/pangenome/HPRCyear1v2genbank/blob/main/workflows/confident_variants.md). Genome-wide, we found that pruning reduced the number of called SNPs by only 1.188% (previous N = 23,272,652, pruned N = 22,996,113). The total reduction in the acrocentrics was higher, with 6.29% fewer SNPs (previous N = 3,735,605, pruned N = 3,676,746), which indicated the difficulty in assembling these regions. We note that the PCA sample distributions remained almost identical (data not shown), which indicated that the patterns observed in the full graph are maintained despite assembly issues. In these filtered PGGB-CHM13 and PGGB-GRCh38 VCFs, we considered all biallelic SNPs relative to the chosen reference, regardless of variant nesting level (data not shown; filtering for only SNPs LV = 0 or LV > 0 produced nearly identical results). A qualitative evaluation suggested no significant differences in PCA patterns across the metacentric chromosomes (Supplementary Fig. 47). However, in the p-arms of the acrocentrics (chromosomes 13, 14, 15, 21 and 22), which are accessible in the PGGB-CHM13 VCF, we observed a reduction in population differentiation and a higher rate of variance explained in the lowest principal component.

To investigate this quantitatively, we measured the number of clusters implied by the PCA for the PGGB-CHM13 VCFs using k-means clustering to automatically determine the optimal number of clusters for each PCA (gap_stat clustering in the fviz_nbclust function of the factoextra R package) (analysis code available at: github.com/SilviaBuonaiuto/hprcPopGenAnalysis). Applying this approach to three PCAs per chromosome VCF, we obtained optimal cluster counts for the p-arm, q-arm and entire chromosome. In metacentric chromosomes, we usually observed optimal numbers of clusters approximately corresponding to the number of expected world population groupings in the input genomes (3–5, as in Supplementary Fig. 48). However, in the p-arms of the acrocentrics, we observed much fewer, in general only one cluster, indicative of reduced population differentiation compared with other parts of the acrocentric chromosomes. This pattern was only apparent in the PGGB graph based on CHM13. To evaluate the difference quantitatively, we applied a Wilcoxon rank-sum test to compare the differences between cluster count distributions in metacentric versus acrocentric chromosomes across the entire chromosome, the q-arm and the p-arm. There were non-significant differences between the distributions between acrocentric and metacentric chromosomes at a chromosome scale and in the q-arms, but a significant difference (Wilcoxon P = 0.013) in the case of acrocentric p-arms (Supplementary Fig. 49).

This analysis indicates that significant challenges remain for the use of these new regions in population genetic studies. Patterns observed in PCA projections of the pangenome across all chromosomes suggests a distinct process of variation sharing between populations within the short arms of the acrocentrics. In effect, we observed a more homogenous population in these regions when using the CHM13 assembly as a reference. This reference contains real sequences in these regions, whereas GRCh38 contains gaps, which render analysis impossible. The apparent population homogenization could be driven by error. We mitigated this issue by utilizing only SNPs found in Flagger-confident regions, but this does not guard against potential sources of alignment error that are likely to be amplified by the repetitive sequences in these loci. It is also possible that the chromosome-specific partitioning process applied by both graph models is failing to correctly partition contigs on these short arms. The known homology between the short arms bolsters the possibility of ongoing sequence information exchange between non-homologous chromosomes3, which would be consistent with the patterns we observed. In summary, this analysis shows that when using CHM13 as a reference, the behaviour of sequences on the short arms of the acrocentrics in the PGGB graph is not similar to that of other sequences in the pangenome.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Read more here: Source link