Non-linear machine learning models incorporating SNPs and PRS improve polygenic prediction in diverse human populations

Study population

The study sample included 34,072 unrelated (3rd degree or less) TOPMed participants from eight U.S. based cohort studies: Jackson Heart Study (JHS; n = 2504), Framingham Heart Study (FHS; n = 3520), Hispanic Community Health Study/Study of Latinos (HCHS/SOL; n = 6,408), Atherosclerosis Risk in Communities study (ARIC; n = 6197), Cardiovascular Health Study (CHS; n = 2835), Multi-Ethnic Study of Atherosclerosis (MESA; n = 3949), Cleveland Family Study (CFS; n = 1182), and Coronary Artery Risk Development in Young Adults Study (CARDIA; n = 2468). Study descriptions are provided in Supplementary Note 1. Phenotypes were harmonized by the TOPMed Data Coordinating Center (DCC)34, and included age, sex, race/ethnicity, study (used as covariates), phenotypes of interest, and medications, which were used to adjust measures of relevant phenotypes (Supplementary Table 12). Note that sex was self-reported and verified by chromosomal sex, and therefore biological sex and gender identify in these analyses are the same. Thus, we refer to this variable as “sex”. The dataset included 7601 non-Hispanic Black participants, 14,142 non-Hispanic White participants, and 7320 participants of Hispanic/Latino descent. The dataset was divided such that 20% of the data was held out as a validation set. A secondary analysis used a larger training dataset that included related individuals but in which all individuals in the training dataset were still unrelated to those in the test dataset. Sex-stratified analyses were not performed due to limited sample size.

Ethical regulations

Participants from each of the studies contributing to the TOPMed consortium provided informed consent, and all studies were approved by IRBs in each of the participating institutions. In detail, the HCHS/SOL was approved by the institutional review boards (IRBs) at each field center, where all participants gave written informed consent, and by the Non-Biomedical IRB at the University of North Carolina at Chapel Hill, to the HCHS/SOL Data Coordinating Center. All IRBs approving the study are Non-Biomedical IRB at the University of North Carolina at Chapel Hill. Chapel Hill, NC; Einstein IRB at the Albert Einstein College of Medicine of Yeshiva University. Bronx, NY; IRB at Office for the Protection of Research Subjects (OPRS), University of Illinois at Chicago. Chicago, IL; Human Subject Research Office, University of Miami. Miami, FL; Institutional Review Board of San Diego State University. San Diego, CA. The Framingham Heart Study was approved by the Institutional Review Board of the Boston University Medical Center. All study participants provided written informed consent. The ARIC study has been approved by Institutional Review Boards (IRB) at all participating institutions: University of North Carolina at Chapel Hill IRB, Johns Hopkins University IRB, University of Minnesota IRB, and University of Mississippi Medical Center IRB. Study participants provided written informed consent at all study visits. All CHS participants provided informed consent, and the study was approved by the Institutional Review Board of the University Washington. All MESA participants provided written informed consent, and the study was approved by the Institutional Review Boards at The Lundquist Institute (formerly Los Angeles BioMedical Research Institute) at Harbor-UCLA Medical Center, University of Washington, Wake Forest School of Medicine, Northwestern University, University of Minnesota, Columbia University, and Johns Hopkins University. All CARDIA participants provided informed consent, and the study was approved by the Institutional Review Boards of the University of Alabama at Birmingham and the University of Texas Health Science Center at Houston. The JHS study was approved by Jackson State University, Tougaloo College, and the University of Mississippi Medical Center IRBs, and all participants provided written informed consent. The Cleveland Family Study was approved by the Institutional Review Board (IRB) of Case Western Reserve University and Mass General Brigham (formerly Partners HealthCare). Written informed consent was obtained from all participants.

Genotype data

We used whole genome-sequencing data from TOPMed35 Freeze 8, without restriction on sequencing depth, which contains 705,486,649 variants. The dataset includes samples sequenced through the National Human Genome Research Institute’s Centers for Common Disease Genomics (CCDG) program, where the sequence data for all TOPMed and CCDG samples were harmonized together via joint allele calling. The methods for TOPMed WGS data acquisition and quality control (QC) are provided in www.nhlbiwgs.org/topmed-whole-genome-sequencing-methods-freeze-8. Principal Components (PCs) and kinship coefficients were computed for the genetic data by the TOPMed DCC using the PC-Relate algorithm36 implemented in the GENESIS R package37. In this work, we used 5 PCs computed via the GENESIS R package PC-Air algorithm38 to adjust for global ancestry. Based on the kinship coefficients, we identified related individuals and generated a dataset in which all individuals were degree-3 unrelated, i.e., all kinship coefficients were lower than 0.0625. We extracted allele counts of variants that passed QC from GDS files using the SeqArray39 package version 1.28.1 and then further processed using R and Python scripts. After QC and filtering variants with MAF < 0.01 (with MAF being computed based on the multi-ethnic TOPMed dataset), we had 12,482,699 variants in the TOPMed data. For all variants, we set the effect allele to be the minor allele.

Heritability estimation

Let K denote an n × n kinship matrix, having twice the kinship coefficient between the ith and jth participants in its I, j entry. For an outcome y, we assume the linear model:

$$\begin{array}{c}{y}_{i}={X}_{i}\beta +\,{\epsilon }_{i},\\ {cov}\left({{{{{\boldsymbol{\epsilon }}}}}}\right)={\sigma }_{\epsilon }^{2}{{{{{{\boldsymbol{I}}}}}}}_{n\times n}+{\sigma }_{k}^{2}{{{{{\boldsymbol{K}}}}}}\end{array}$$

(1)

where \({{{{{\boldsymbol{\epsilon }}}}}}={\left({\epsilon }_{1},\ldots ,{\epsilon }_{n}\right)}^{T}\) is the normally-distributed vector of errors across the sample. We estimated the narrow-sense heritability \({\hat{{h}}}^{2}={\hat{\sigma }}_{k}^{2}/{\hat{\sigma }}_{k}^{2}+{\hat{\sigma }}_{e}^{2}\) using the Restricted Maximum-Likelihood approach as implemented in the GCTA40 software (version 1.93.2). Confidence intervals at the 95% level are provided using standard errors (SEs) approach and based on the assumption of an asymptotic normal distribution (estimate\(\pm 1.96\times\) SE).

Phenotypes

We trained genetic prediction models to predict sleep duration, diastolic blood pressure, systolic blood pressure, triglycerides, LDL cholesterol, HDL cholesterol, total cholesterol, body mass index, and height; and we used sex, study, race/ethnicity, and age as covariates. For reproducibility, Supplementary Table 12 provides the coded names of each of the phenotypes and covariates used in the analysis and describes in detail transformations and exclusions.

For each of the phenotypes of interest, we excluded outlying individuals defined by phenotypic values above the 99th quantile and values below the 1st quantile for the phenotype, computed over the multi-ethnic dataset. Values of systolic blood pressure in individuals using antihypertensive medications were set to missing, and similarly, values of triglycerides and total cholesterol levels of individuals using cholesterol medications. Triglycerides and total cholesterol values were log transformed to obtain an approximate normal distribution. Values of diastolic blood pressure in individuals using antihypertensive medications were increased by 10 mmHg. Then, each phenotype was regressed on age, sex, study, and race/ethnicity. The residuals were extracted and rank-normalized. Subsequent analyses used these rank-normalized residuals as the outcomes41, and we refer to them henceforth as adjusted phenotypes.

Summary statistics from published GWAS

We used summary statistics from published GWAS to select SNPs and their weights to construct PRS, as well as to select SNPs to include in the ML models. The GWAS used for each of the nine phenotypes are described in Table 3. When possible, we used multi-ethnic GWAS. We lifted over the coordinates to our genome build GRCh38/hg38 using the LiftOver tool from the UCSC genome browser42. For about half of the phenotypes, 90% or more of the variants in the GWAS were found in the TOPMed data. For the other half, 60–70% were found (Supplementary Table 13).

Table 3 Description of published GWAS used to identify summary statistics.

Polygenic risk score (PRSice, LDpred2, and lassosum2)

We calculated the standard PRS using the classic clump-and-threshold methodology (PRSice PRS)2. We used PRSice 2 software version 2.3.143 to calculate the genetic score. SNPs with ambiguous alleles were removed using the PRSice software. We calculated PRS using two clumping regions (250 and 500 kb) on each side of the index SNP and three clumping R2 values (0.1, 0.2, 0.3). We considered p value thresholds of 0.5 through 1e−10. For each adjusted phenotype and each PRS defined by clumping region, clumping R2 value, and p value threshold, we fit a linear model including covariates, the PRS, and genetic PCs to account for population structure44. We selected the PRS where the PRS model minimized the mean squared error in the training dataset. We assessed the percentage of variance explained (PVE) by the PRS models using the methodology described below.

We also calculated the PRS using LDpred245 (LDpred2 PRS). We used the entire multi-ethnic TOPMed data to perform clumping with PRSice and to compute SNP weights using LDPred2. We used R package bigsnpr version 1.9.10 to calculate the genetic score with LDpred2-inf, which utilizes the infinitesimal model, LDpred2-grid which uses a grid of values for the hyperparameters, and LDpred2-auto, which is an automatic estimation of the proportion of causal variants (p) and the SNP heritability (h2) from the data, without any tunable hyperparameters. Because LDpred2 can compute joint weights for up to ~1 million SNPs, we used two approaches to select SNPs: (a) we pruned SNPs using pruning parameters R2 = 0.1 and distance = 500 kb; and (b) the top 1 million SNPs with respect to their association p value in the summary statistics (and that overlapped with the TOPMed dataset) to train the model. Following the recommendation provided by LDPred2 manuscript, we used the bigsnpr R package to correct the estimated effect sizes for the winner’s curse prior to applying LDPred2.

Finally, we calculated PRS via penalized regression on summary statistics, lassosum246. We used the lassosum2 implementation in R packages bigsnpr v1.9.5 and bigstatsr v1.5.647, using the default hyperparameters as described in detail in Preivé et al.48, including the top 1 million SNPs from each GWAS.

All methods (PRSice, LDPred2, and lassosum2) require reference panel for LD inference used for clumping (PRSice) or for tuning SNP effect sizes (LDPred2, lassosum2). We used the multi-ethnic TOPMed dataset as an LD reference panel.

After calculating the C+T PRSice PRS, LDpred2 PRS, and lassosum2 PRS, we selected the best-performing PRS in the training set for the purposes of comparison between linear PRS models and XGBoost.

LASSO and Gradient Boosted Trees (XGBoost) ensemble

Figure 2 describes the construction of an ensemble ML model for polygenic risk prediction. We considered for inclusion in the models all SNPs having p value < 1 × 10−4 in the corresponding GWAS and used them to develop an ensemble prediction model. In brief, the ensemble model included two steps: (1) a LASSO penalized regression for filtering candidate SNPs; and (2) an XGBoost prediction model allowing for nonlinear interactions. In detail, gradient boosted trees are a widely used machine learning technique that creates an ensemble of weak decision trees (i.e., limited in depth or interactions) by iteratively optimizing an objective function at each boosting step in which new trees are optimized based on the residuals of the previous boosting step. XGBoost is an optimized implementation of gradient boosted trees that is highly efficient in distributed computing environments13. However, the set of candidate SNPs is very large for most of the GWAS listed in Table 3, and boosting is prone to overfitting with high dimensionality49. LASSO50 is a commonly used model for feature selection that can mitigate overfitting by encouraging parsimony through L1 regularization. We trained an ensemble model, jointly training LASSO and XGBoost models in order to prevent overfitting due to the high dimensionality of genetic data (through LASSO) while simultaneously exploiting the nonlinear relationships and interaction effects (through XGBoost).

The ensemble model was trained as follows. For each given regularization hyperparameter \(\alpha \in \{0\ldots 1\}\) we fit LASSO on the training dataset using a 10-fold cross-validation scheme (and the MSE loss). The LASSO model included linear SNP effects and unpenalized covariates, and, to reduce required computational resources, it was separately fit using the same α on 5 sets of SNPs, each including all SNPs from a few chromosomes, set so that the number of SNPs is roughly equivalent between models. Next, we filtered to SNPs with non-zero coefficients from the LASSO model. Using these selected SNPs, we fit the XGBoost model via 3-fold cross-validation applied on the training dataset, allowing up to 10,000 boosted trees with early stopping after 10 rounds of boosting without improvement in the threefold cross-validation loss (see Table 2 for details). Based on this threefold cross validation, we selected the number of trees \({\theta }_{\alpha }\) that minimized the mean squared prediction error (MSE), resulting in a set of parameters \((\alpha ,{\theta }_{\alpha })\). We selected the optimal \((\alpha ,{\theta }_{\alpha })\) pair that minimized the MSE of the threefold cross-validation step across all values of α. For XGBoost, we always used a learning rate of 0.01, maximum depth of 5, column sample by tree of 90%, minimum child weight of 10, and subsample of 50%. Finally, we performed LASSO regression using the same variants that were selected in this process, to explicitly compare the results of a nonlinear model allowing for interactions to a linear model without interactions.

We performed this process individually for each of our nine adjusted phenotypes using a distributed cluster computing environment. This was a regression task. All models included the ancestral PCs, and some models used the best-performing PRS as a variable. We assessed the PVE of the genetic machine learning models using the methodology described below and compared the results with the best-performing linear PRS model. Analysis was conducted using Python 3 and the scikit-learn51 and xgboost packages13.

Secondary analysis comparing the ensemble model with a standard linear PRS model using the same potential SNP set

Because we limited the SNPs used by the ensemble model to those with p value < 10−4 in their discovery GWAS, in a secondary analysis we compare the performance of the ensemble models to linear PRS models with C+T PRS that use the p value < 10−4 threshold for SNP selection. Thus, the two models rely on the same set of candidate SNPs.

Secondary analyses studying SNP selection into the XGBoost model

We performed two experiments to test the benefit of using LASSO to select SNPs prior to the XGBoost model. Each experiment considered a different way to select SNPs into the XGBoost model. In the first experiment, we selected SNPs at random as a baseline for four phenotypes (total cholesterol, triglycerides, LDL cholesterol, and HDL cholesterol). We used the same number of SNPs as the number selected by LASSO in the respective XGBoost Alone and XGBoost with PRS models. We have performed this experiment for four phenotypes (total cholesterol, triglycerides, LDL cholesterol, and HDL cholesterol), by (1) randomly selecting SNPs in the same size as the LASSO selected SNPs for those phenotypes, 92) running the XGBoost model with and without PRS, (3) repeating 100 times, and (4) averaging the result. In the second experiment, for LDL cholesterol we used the SNPs with non-zero weighting in the lassosum2 PRS as the selected SNPs in the XGBoost model. We used a limited set of phenotypes for these experiments due to computational limitations.

Race/ethnicity analysis

We first trained the models using the combined, multi-ethnic dataset (multi-ethnic model). We then trained the models on the subset of the sample containing only White, Black, and Hispanic/Latino participants. This resulted in four models that were each trained on different race/ethnicity groups: Multi-Ethnic, White, Black, and Hispanic/Latino. For each of these four models, we assessed the PVE among the participants of each race/ethnicity in the held-out test set.

Model evaluation in the held-out test set

We quantify model performance as the variance explained (sometimes referred to as the adjusted R2). Let \({y}_{i}^{0},{i}=1,\ldots ,{n}\) denote the adjusted phenotype. \({{{{{\rm{Var}}}}}}({y}^{0})\) estimates the total baseline model variance. For a given model m, let \({\hat{{y}_{i}}}^{m}\) denote the predicted (adjusted) phenotype value for the ith person. We estimate the percent variance explained by model m as:

$${{{{{\rm{PVE}}}}}}=\left(1-\,\frac{{{{{{\rm{var}}}}}}({y}^{0}-{\hat{y}}^{m})}{{{{{{\rm{var}}}}}}\left({y}^{0}\right)}\right)\times 100 \% .$$

(2)

We compute the relative PVEs between various models as the relative percentage increase, i.e., (PVE2 − PVE1)/PVE1.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Read more here: Source link