Biobank-scale inference of ancestral recombination graphs enables genealogical analysis of complex traits

ARG-Needle and ASMC-clust algorithms

We introduce two algorithms to construct the ARG of a set of samples, called ARG-Needle and ASMC-clust. Both approaches leverage output from the ASMC algorithm11, which takes as input a pair of genotyping array or sequencing samples and outputs a posterior distribution of the TMRCA across the genome. ARG-Needle and ASMC-clust use this pairwise genealogical information to assemble the ARG for all individuals.

ASMC-clust runs ASMC on all pairs of samples and performs hierarchical clustering of TMRCA matrices to obtain an ARG. At every site, we apply the unweighted pair group method with arithmetic mean (UPGMA) clustering algorithm58 on the N × N posterior mean TMRCA matrix to yield a marginal tree. We combine these marginal trees into an ARG, using the midpoints between sites’ physical positions to decide when one tree ends and the next begins. Using an O(N2) implementation of UPGMA59,60, we achieve a runtime and memory complexity of O(N2M). We also implement an extension that achieves O(NM) memory but increased runtime (Supplementary Note 1).

ARG-Needle starts with an empty ARG and repeats three steps to add additional samples to the ARG: (1) detecting a set of closest genetic relatives via hashing, (2) running ASMC and (3) ‘threading’ the new sample into the ARG (Fig. 1). Given a new sample, step 1 performs a series of hash table queries to determine the candidate closest samples already in the ARG24. We divide up the sites present in the genetic data into nonoverlapping ‘words’ of S sites and store hash tables mapping from the possible values of the ith word to the samples that carry that word. We use this approach to rapidly detect samples already in the ARG that share words with the target sample and return the top K samples with the most consecutive matches. A tolerance parameter T controls the number of mismatches allowed in an otherwise consecutive stretch. We also allow the top K samples to vary across the genome due to recombination events, by partitioning the genome into regions of genetic distance L. Assuming this results in R regions, the hashing step outputs a matrix of R × K sample identities (IDs) containing the predicted top K related samples for each region. We note, however, that the hashing step can look arbitrarily far beyond the boundaries of each region to select the K samples.

The sample IDs output by step 1 inform step 2, in which ASMC is run over pairs of samples. In each of the R regions, ASMC computes the posterior mean and maximum a posteriori TMRCA between the sample being threaded and each of the K candidate most related samples. We add burn-in on either side of the region to provide additional context for the ASMC model, 2.0 centimorgans (cM) for all simulation experiments unless otherwise stated and 1.0 cM in real data inference for greater efficiency.

In step 3, ARG-Needle finds the minimum posterior mean TMRCA among the K candidates at each site of the genome. Note that both the use of a posterior mean estimator with a pairwise demographic prior and the selection of a minimum among K estimated values lead to bias in the final TMRCA estimates (Supplementary Fig. 3h), which we later address using a postprocessing normalization step (see below). The corresponding IDs determine which sample in the ARG to thread to at each site. Because the posterior mean assumes continuous values and changes at each site, we average the posterior mean over neighboring sites where the ID to thread to and the associated maximum a posteriori remain constant. This produces piecewise constant values which determine how high above the sample to thread, with changes corresponding to inferred recombination events. The sample is then efficiently threaded into the existing ARG, utilizing custom data structures and algorithms.

Throughout our analyses we adopted K = 64, T = 1, L = 0.5 cM for array data and L = 0.1 cM for sequencing data. We used S = 16 in simulations, and in real data analyses we increased S as threading proceeded to reduce computation without a major loss in accuracy. For additional details on all three steps in the ARG-Needle algorithm and our parameter choices, see Supplementary Note 1.

ARG normalization

ARG normalization applies a monotonically increasing mapping from existing node times to transformed node times (similar to quantile normalization), further utilizing the demographic prior provided in input. We compute quantile distributions of node times in the inferred ARG as well as in 1,000 independent trees simulated using the demographic model provided in input under the single-locus coalescent. We match the two quantile distributions and use this to rewrite all nodes in the inferred ARG to new corresponding times (Supplementary Note 1). ARG normalization preserves the time-based ordering of nodes and therefore does not alter the topology of an ARG. It is applied by default to our inferred ARGs and optionally to ARGs inferred by Relate (Extended Data Figs. 24 and Supplementary Figs. 1, 3 and 4).

Simulated genetic data

We used the msprime coalescent simulator61 to benchmark ARG inference algorithms. For each run, we first simulated sequence data with given physical length L for N haploid individuals, with L = 1 Mb for sequencing and L = 5 Mb for array data experiments. Our primary simulations used a mutation rate of μ = 1.65 × 10−8 per base pair per generation, a constant recombination rate of ρ = 1.2 × 10−8 per base pair per generation and a demographic model inferred using SMC++ on CEU (Utah residents with ancestry from Northern and Western Europe) 1,000 Genomes samples10. These simulations also output the simulated genealogies, which we refer to as ‘ground-truth ARGs’ or ‘true ARGs’. To obtain realistic SNP data, we subsampled the simulated sequence sites to match the genotype density and allele frequency spectrum of UK Biobank SNP array markers (chromosome 2, with density defined using 50 evenly spaced MAF bins). When running ASMC, we used decoding quantities precomputed for version 1.1, which were obtained using a European demographic model and UK Biobank SNP array allele frequencies, setting two haploid individuals for pairwise TMRCA inference as ‘distinguished’ and sampling 298 haploid individuals as ‘undistinguished’11. ASMC and the hashing step of ARG-Needle also require a genetic map, which we computed based on the recombination rate used in simulations.

In addition to our primary simulations, we included various additional simulation conditions where we modified one parameter while keeping all others fixed. First, we varied the recombination rate to ρ {6 × 10−9, 2.4 × 10−8} per base pair per generation. Second, we used a constant demographic model of 15,000 diploid individuals, for which we generated new decoding quantities to run ASMC. Third, we inferred ARGs using sequencing data, running ASMC in sequencing mode. Fourth, we introduced genotyping errors into the array data. After sampling the array SNPs, we flipped each haploid genotype per SNP and individual with probability p (Supplementary Fig. 4).

Comparisons of ARG inference methods

We compared ASMC-clust and ARG-Needle with the Relate17 and tsinfer15 algorithms. Relate runs a modified Li-and-Stephens algorithm62 for each haplotype, using all other haplotypes as reference panel. It then performs hierarchical clustering on the output to estimate the topology of marginal trees at each site. Finally, it estimates the marginal tree branch lengths using a Markov chain Monte Carlo algorithm with a coalescent prior. tsinfer uses a heuristic approach to find a set of haplotypes that will act as ancestors for other haplotypes and to rank them based on their estimated time of origin. It then runs a variation of the Li-and-Stephens algorithm to connect older ancestral haplotypes to their descendants, forming the topology of the ARG. To improve the performance of tsinfer in the analysis of UK Biobank array data, the authors developed an alternative approach where subsets of the analyzed individuals are added as potential ancestors15. This approach was motived by the sparsity of the variant sites, so we refer to it as ‘tsinfer-sparse’, obtaining its code from ref. 63.

We ran Relate with the mutation rate, recombination rate and demographic model used in simulations. We kept Relate’s default option which limits the memory used for storing pairwise matrices to 5 GB. Because the branch lengths output by tsinfer and tsinfer-sparse are not calibrated, we omitted these methods in comparisons for metrics involving branch lengths. For each choice of sample size, we generated genetic data using five random seeds (25 random seeds in Extended Data Fig. 3d,e) and applied ARG-Needle, ASMC-clust, Relate, tsinfer and (when dealing with array data) tsinfer-sparse to infer ARGs. Due to scalability differences, we ran ASMC-clust and Relate in up to N = 8,000 haploid samples (N = 4,000 for sequencing) and ARG-Needle, tsinfer and tsinfer-sparse in up to N = 32,000 haploid samples. All analyses used Intel Skylake 2.6 GHz nodes on the Oxford Biomedical Research Computing cluster.

The Robinson–Foulds metric27 counts the number of unique mutations that can be generated by one tree but not the other. Because polytomies can skew this metric, we randomly break polytomies as done in ref. 15. We report a genome-wide average, rescaled between 0 and 1.

We generalized the Robinson–Foulds metric to better capture the accuracy in predicting unobserved variants by incorporating ARG branch lengths. To this end, we consider the probability distribution of mutations that arise from uniform sampling on an ARG, and compare the resulting distributions for the true and inferred ARG using the total variation distance, a metric for comparing probability measures. Polytomies do not need to be broken using this metric, as they simply concentrate the probability mass on fewer predicted mutations. We refer to this metric as the ARG total variation distance, and note that it bears similarities to previous extensions of the Robinson–Foulds metric64,65 (see Supplementary Note 2 for further details, including an extension that stratifies by allele frequencies).

We also used the KC topology-only distance averaged over all positions to compare ARG topologies. We observed that for methods that output binary trees (Relate, ASMC-clust and ARG-Needle), performance substantially improved when we selected inferred branches at random and collapsed them to create polytomies (solid lines in Extended Data Figs. 1c and 3g), suggesting that the KC topology-only distance rewards inferred ARGs with polytomies. We further quantified the amount of polytomies output by tsinfer and tsinfer-sparse as the mean fraction of nonleaf branches collapsed per marginal tree, observing that when polytomies were randomly broken15, performance on the KC topology-only distance deteriorated (dashed lines in Fig. 2d and Extended Data Figs. 1b,c and 3f,g). To account for these observations, we compared all methods both with the restriction of no polytomies and with allowing all methods to output polytomies (Fig. 2d and Extended Data Figs. 1b,d and 3f,h). In the latter case, we formed polytomies in ARGs inferred by Relate, ASMC-clust and ARG-Needle using a heuristic to select and collapse branches that are less confidently inferred. For each marginal binary tree, we ordered the N − 2 nonleaf branches by computing the branch length divided by the height of the parent node, and collapsed a fraction f of branches for which this ratio is smallest (for additional details, see Supplementary Note 2).

We used the pairwise TMRCA RMSE metric to measure accuracy of inferred branch lengths. The KC distance may also consider branch lengths28, and we performed evaluations using the branch-length-aware versions of the KC distance with parameter λ = 1, which compares lengths between pairwise MRCA events and the root, and λ = 0.02, which combines branch length and topology estimation (Supplementary Fig. 1).

Supplementary Note 2 provides further details on the computation of these metrics and their interpretation in the context of ARG inference and downstream analyses.

ARG-MLMA

We developed an approach to perform MLMA of variants extracted from the ARG, which we refer to as ARG-MLMA. We sampled mutations from a given ARG using a specified rate μ and applied a mixed model association test to these variants. Note that each mutation occurs on a single branch of marginal trees, so that recurrent mutations are not modeled.

For simulation experiments (Fig. 3a and Extended Data Fig. 6) we tested all possible mutations on a true or inferred ARG, which is equivalent to adopting a large value of μ. We used sequencing variants from chromosomes 2–22 to form a polygenic background and added a single causal sequencing variant on chromosome 1 with effect size β. We varied the value of β and measured power as the fraction of runs (out of 100), detecting a significant association on the ARG for chromosome 1. For ARG-MLMA UK Biobank analyses we adopted μ = 10−5, also adding variants sampled with μ = 10−3 to locus-specific Manhattan plots to gain further insights. For additional details on our ARG-MLMA methods, including the determination of significance thresholds, see Supplementary Note 4.

Construction of ARG-GRMs

Consider N haploid individuals, M sites and genotypes xik for individual i at site k, where variant k has mean pk. Under an infinitesimal genetic architecture, the parameter α captures the strength of negative selection30,66, with a trait’s genetic component given by gi = ∑k βkxik where Var(βk) = (pk(1 − pk))α. Using available markers, a common estimator for the ij-th entry of the N × N GRM21 is

$$K_\alpha \left( {i,j} \right) = \frac{1}{M}\mathop {\sum }\limits_{k = 1}^M \frac{{\left( {x_{ik} – p_k} \right)\left( {x_{jk} – p_k} \right)}}{{\left[ {p_k\left( {1 – p_k} \right)} \right]^{ – \alpha }}}.$$

(1)

Given an ARG, we compute the ARG-GRM as the expectation of the marker-based GRM that would be obtained using sequencing data, assuming that mutations are sampled uniformly over the area of the ARG. When sequencing data are unavailable but an ARG can be estimated from an incomplete set of markers, the ARG-GRM may provide a good estimate for the sequence-based GRM. We briefly describe how ARG-GRMs are derived from the ARG for the special case of α = 0. We discuss the more general case and provide further derivations in Supplementary Note 3.

Assuming α = 0, equation (1) is equivalent (up to invariances described in Supplementary Note 3) to the matrix whose ij-th entry contains the number of genomic sites at which sequences i and j differ (that is, their Hamming distance). This may be expressed as

$$K_{\mathrm{H}}\left( {i,j} \right) = \frac{1}{M}\mathop {\sum }\limits_{k = 1}^M x_{ik} \oplus x_{jk},$$

where refers to the exclusive or (XOR) function. Assume there are L base pairs in the genome and a constant mutation rate per base pair of μ, and let tijk denote the TMRCA of i and j at base pair k. The ij-th entry of the ARG-GRM is equivalent to the expected number of mutations carried by only one of the two individuals, which is proportional to the sum of the pairwise TMRCAs across the genome (Extended Data Fig. 7a):

$$\begin{array}{l}K_{\mathrm{ARG}}\left( {i,j} \right) = {\Bbb E}\left[ {K_{\mathrm{H}}\left( {i,j} \right)|\mathrm{ARG}} \right]\\ = \mathop {\sum }\limits_{k = 1}^L P\left( {\mathrm{Poisson}\left( {2\mu t_{ijk}} \right) > 0} \right) = \mathop {\sum }\limits_{k = 1}^L 1 – {{{\mathrm{exp}}}}\left( { – 2\mu t_{ijk}} \right) \approx \mathop {\sum }\limits_{k = 1}^L 2\mu t_{ijk}.\end{array}$$

For increased efficiency, we compute a Monte Carlo ARG-GRM by uniformly sampling new mutations on the ARG with a high mutation rate and apply equation (1) to build the ARG-GRM using these mutations. We used simulations to verify that Monte Carlo ARG-GRMs converge to exactly computed ARG-GRMs for large mutation rates, saturating at μ ≈ 1.65 × 10−7 (Extended Data Fig. 7b,c), the default value we used for ARG-GRM computations. Stratified Monte Carlo ARG-GRMs may also be computed by partitioning the sampled mutations based on allele frequency, LD or allele age36,31,67,68 (Supplementary Note 3).

ARG-GRM simulation experiments

We simulated polygenic traits from haploid sequencing samples for various values of h2 and α. We varied the number of haploid samples N but fixed the ratio L/N throughout experiments, where L is the genetic length of the simulated region. For heritability and polygenic prediction experiments, we adopted L/N = 5 × 10−3 Mb per individual. For association experiments, we simulated a polygenic phenotype from 22 chromosomes, with each chromosome consisting of equal length L/22 and L/N = 5.5 × 10−3 Mb per individual. Mixed-model prediction r2 and association power may be roughly estimated as a function of h2 and the ratio N/M, where M is the number of markers39,69,70. We thus selected values of M and L such that the N/M ratio is kept close to that of the UK Biobank (L = 3 × 103 Mb, N ≈ 6 × 105).

We computed GRMs using ARGs, SNP data, imputed data (IMPUTE4 (ref. 38) within-cohort imputation) and sequencing data, and performed complex trait analyses using GCTA21. Polygenic prediction used cvBLUP71 leave-one-out prediction within GCTA. ARG-GRM association experiments (Fig. 3c and Extended Data Fig. 8c,f) tested array SNPs on each chromosome while using GRMs built on the other 21 chromosomes to increase power, measured as the relative increase of mean −log10(P) compared with linear regression. We observed that MAF-stratification for ARG-GRMs of true ARGs enabled robust heritability estimation and polygenic prediction if α is unknown (Extended Data Fig. 8g). In experiments involving inferred ARGs (Fig. 3b and Supplementary Fig. 8), we applied MAF-stratification for ARG-Needle ARGs and imputed data, but not for SNP data, for which GCTA did not converge.

ARG-Needle inference in the UK Biobank

Starting from 488,337 samples and 784,256 available autosomal array variants (including SNPs and short indels), we removed 135 samples (129 withdrawn, 6 due to missingness > 10%) and 57,126 variants (missingness > 10%). We performed reference-free phasing of the remaining variants and samples using Beagle 5.1 (ref. 72) and extracted the unrelated white British ancestry subset defined in ref. 38, yielding 337,464 samples. We built the ARG of these samples using ARG-Needle, using parameters described above. We parallelized the ARG inference by splitting phased genotypes into 749 nonoverlapping ‘chunks’ of approximately equal numbers of variants. We added 50 variants on either side of each chunk to provide additional context for inference and independently applied ARG normalization on each chunk. For our brief comparison of ARG inference methods in real data (Supplementary Fig. 6c,d), we repeatedly sampled independent subsets of N = 2,000 and N = 16,000 diploid individuals, and inferred the ARG for these individuals using the first chunk in the second half of chromosome 1, which amounted to 7.5 Mb.

Genealogy-wide association scan in the UK Biobank

To process phenotypes (standing height, alkaline phosphatase, aspartate aminotransferase, low-density lipoprotein (LDL)/high-density lipoprotein (HDL) cholesterol, mean platelet volume and total bilirubin) we first stratified by sex and performed quantile normalization. We then regressed out age, age squared, genotyping array, assessment center and the first 20 genetic principal components computed in ref. 38. We built a noninfinitesimal BOLT-LMM mixed model using SNP array variants, then tested HRC + UK10K-imputed data38,40,41 and variants inferred using the ARG (ARG-MLMA, described above). For association of imputed data (including SNP array) we restricted to variants with Hardy–Weinberg equilibrium P > 10−12, missingness < 0.05 and info score > 0.3 (matching the filtering criteria adopted in ref. 38). For all analyses we did not test variants with an MAC < 5 and used MAF thresholds detailed below.

Association analysis of seven traits

Using the filtering criteria above and no additional MAF cutoff, we obtained resampling-based genome-wide significance thresholds of P < 4.8 × 10−11 (95% CI: 4.06 × 10−11, 5.99 × 10−11) for ARG and P < 1.06 × 10−9 (95% CI: 5.13 × 10−10, 2.08 × 10−9) for imputation (Supplementary Table 1 and Supplementary Note 4). After performing genome-wide MLMA for the seven traits, we selected genomic regions harboring low-frequency (0.1% ≤ MAF < 1%), rare (0.01% ≤ MAF < 0.1%) or ultra-rare (MAF < 0.01%) variant associations. We then formed regions by grouping any associated variants within 1 Mb of each other and adding 0.5 Mb on either side of these groups.

We next performed several further filtering and association analyses using a procedure similar to ref. 43 to extract sets of approximately independent signals (‘independent’ for short; Supplementary Tables 25 and Supplementary Note 4). Of the seven phenotypes, total bilirubin did not yield any rare or ultra-rare independent signals and height did not yield any independent ultra-rare signals. We leveraged the UK Biobank WES data to validate and localize independent associations. We extracted 138,039 exome-sequenced samples that overlap with the analyzed set of white British ancestry individuals and performed lift-over of exome sequencing positions from genome build hg38 to hg19. We computed pairwise LD between the set of independent associated variants and the set of all WES variants, defining the ‘WES partner’ of an independent variant to be the WES variant with largest r2 to it. In a few instances, the same WES partner was selected by two ARG variants or two imputation variants (Supplementary Tables 25). Additionally, three WES partners were selected by one ultra-rare ARG and one rare imputation variant, and one WES partner was selected by one rare ARG and two ultra-rare imputation variants; these WES partners counted towards the 36 WES partners identified by both methods in rare and ultra-rare analyses, but were not counted as jointly identified when restricting to only rare or ultra-rare categories (as in Fig. 4a). We labeled WES variants by gene and functional status (‘loss-of-function’ and ‘other protein altering’; Supplementary Note 4) based on annotations obtained using the Ensembl Variant Effect Predictor (VEP) tool73.

Association analysis for higher frequency variants and height

For genome-wide association analyses of higher frequency variants and height, we matched filtering criteria used in ref. 38, retaining imputed variants that satisfy the basic filters listed above, as well as MAF ≥ 0.1%. Using these criteria, we estimated a resampling-based genome-wide significance threshold of 4.5 × 10−9 (95% CI: (2.2 × 10−9, 9.6 × 10−9); Supplementary Table 1). To facilitate direct comparison, we aimed to select parameters that would result in a comparable significance threshold for the ARG-MLMA analysis. Two sets of parameters satisfied this requirement: 3.4 × 10−9 (95% CI: 2.4 × 10−9, 5 × 10−9), obtained for μ = 10−5, MAF ≥ 1%; and 4 × 10−9 (95% CI: 3.1 × 10−9, 5.3 × 10−9), obtained for μ = 10−6, MAF ≥ 0.1%. We selected the former set of parameters, as a low sampling rate of μ = 10−6 leads to worse signal-to-noise and lower association power. We thus used a significance threshold of P < 3 × 10−9 for all analyses of higher frequency variants. We used PLINK74,75 (v.1.90b6.21 and v.2.00a3LM) and GCTA21 (v.1.93.2) to detect approximately independent associations using COJO47, retaining results with COJO P < 3 × 10−9 (Fig. 5e,f, Extended Data Fig. 10e, Supplementary Fig. 11 and Supplementary Note 4).

Statistics and reproducibility

For real data analysis in the UK Biobank, we included all 337,464 individuals of white British ancestry (as reported in ref. 38) who did not have genotype missingness > 10% and had not withdrawn from the UK Biobank at the time of our analysis. To further explore our findings, we selected the 138,039 of these individuals who were exome sequenced in the 200,000 UK Biobank whole-exome sequencing release.

Ethics

UK Biobank data were analyzed after approval of UK Biobank proposal no. 43206.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Read more here: Source link