Genetic studies of paired metabolomes reveal enzymatic and transport processes at the interface of plasma and urine

Study design and participants

The GCKD study is an ongoing prospective observational study that enrolled 5,217 adult persons with CKD between 2010 and 2012. Patients regularly seen by nephrologists with eGFR between 30 and 60 ml min−1 per 1.73 m2 or eGFR >60 ml min−1 per 1.73 m2 with UACR > 300 mg per g (or urinary protein/creatinine ratio > 500 mg per g) were included53. This study used biomaterials collected at the baseline visit, shipped frozen to a central biobank and stored at −80 °C54. A more detailed description of the study design, standard operating procedures and the recruited study population has been published53,55. The GCKD study was registered in the national registry for clinical studies (DRKS 00003971) and approved by local ethic committees of the participating institutions (universities or medical faculties of Aachen, Berlin, Erlangen, Freiburg, Hannover, Heidelberg, Jena, München and Würzburg)53. All participants provided written informed consent. For this project, metabolites were quantified from stored EDTA plasma and spot urine. Information on genome-wide genotypes, covariates and metabolites was available for 4,960 (plasma) and 4,912 (urine) persons.

Genotyping and imputation

Genotyping and data cleaning in the GCKD study were conducted as follows5,56. Genomic DNA from GCKD participants was genotyped at 2,612,357 variants using Illumina Omni2.5Exome BeadChip arrays and imputed using minimac3 version 2.0.1 at the Michigan Imputation Server57 and the Haplotype Reference Consortium haplotype version r1.1 and Eagle 2.3 for phasing. On the variant level, SNPs with <96% call rate, imputation quality of r2 ≤ 0.3, MAF < 1% or deviating from Hardy–Weinberg equilibrium (P < 1 × 10−10) and all multi-allelic SNPs were removed. The cleaned genotype dataset contained 5,034 individuals and 7,724,508 high-quality autosomal variants for GWAS. Genotyping of ARIC samples was performed on the Affymetrix 6.0 DNA microarray and filtered for call rates <90% and Hardy–Weinberg equilibrium P values < 10−6. SNPs were then imputed to the TOPMed Freeze 5b reference panel and filtered for r2 ≤ 0.1 (imputation quality).

Metabolite identification and quantification

Non-targeted mass spectrometry analysis was performed at Metabolon, and sample preparation was carried out as published by Schlosser et al.5. Automated comparison of the ion features in the experimental samples to a reference library of chemical standard entries (>4,500 purified standards) was used for metabolite identification. Known metabolites reported in this study conformed to confidence level 1 (the highest confidence level of identification) of the Metabolomics Standards Initiative58,59, unless otherwise denoted with an asterisk. Additional mass spectral entries have been created for compounds of unknown structural identity (unnamed biochemicals; >2,750 in the Metabolon library), which have been identified by virtue of their recurrent nature (both chromatographic and mass spectral). Peaks were quantified using the area under the curve and normalized to correct for variation resulting from instrument interday tuning differences by the median value for each run day. Likewise, metabolites in the ARIC replication sample were also quantified with the Metabolon HD4 platform.

Data cleaning of quantified metabolites

An in-house pipeline was set up for data quality control, filtering and normalization of metabolite concentrations. No plasma specimens and four pairs of urine specimens with a Pearson correlation coefficient greater than 0.9 and differing sample IDs were removed. Four plasma specimens and no urine specimens were removed for >50% missing data. A total of 130 plasma and 131 urine metabolites were removed, as less than 300 genotyped samples were available.

To account for urine dilution, concentrations of each metabolite were pq normalized based on endogenous metabolites with <1% missing values (nmetabolites = 309)60. Of the log2-transformed metabolites, 15 plasma metabolites were excluded for low variance (<0.01), and none were excluded for too many outliers (>5% of samples outlying >5 s.d.). Three plasma samples and one urine sample represented an outlier >5 s.d. along one of the first 15 principal components based on metabolites with complete information. The final dataset consisted of 1,296 plasma and 1,401 urine log2-transformed traits for subsequent GWAS. Supplementary Table 2 provides detailed annotation of the metabolites, including heritability estimates for metabolites with at least one genetic association. Over the course of this project, two formerly different urine metabolites were merged because they represented the same molecule: X-12739 and X-24527 to the glutamine conjugate of C6H10O2 (1)* and X-23667 and X-24759 to (2-butoxyethoxy)acetic acid.

Definition of additional variables

In the GCKD study, an IDMS-traceable enzymatic assay (Creatinine Plus, Roche) was used to measure serum creatinine levels, for estimating GFR by means of the CKD-EPI formula61, and to measure urine creatinine levels. The Tina-quant Albumin assay (Roche) was used to measure serum and urine albumin, for adjustment and calculation of the UACR, respectively. The GFR was estimated in the ARIC study from serum creatinine and cystatin C using the CKD-EPI formula62.

Genome-wide association study of metabolite levels

Based on log2-transformed metabolite levels, residuals adjusted for age, sex and the first three genetic principal components were generated (similar to previous mGWAS5,6,56,63,64), with plasma levels additionally adjusted for ln(eGFR) and serum albumin. GWAS analyses of these residuals were performed with SNPTEST version 2.5.2 (www.well.ox.ac.uk/~gav/snptest/), using imputed genotype dosages and linear regression under additive modeling. Statistical significance was defined as genome-wide significance after correcting for multiple testing by a Bonferroni procedure (3.9 × 10−11 = 5 × 10−8 ÷ 1,296 plasma traits; 3.6 × 10−11 = 5 × 10−8 ÷ 1,401 urine traits).

Significantly associated SNPs were assigned to mQTL by selecting, for each trait, the SNP with the lowest P value as the index SNP, defining the corresponding locus as a 1-Mb interval centered on the index SNP and repeating the procedure for unassigned SNPs until no further genome-wide significant SNP remained. For each trait, overlapping intervals were combined into mQTL. The extended MHC region (chromosome 6, 25.5–34 Mb) was treated as one region. For associations with MAF < 3%, mQTLs were only kept if the index SNP remained significant with inverse-normal-transformed metabolite data. A regional association plot centered on the index SNP for each mQTL was generated using LocusZoom (version 1.3) and LD information from GCKD study genotypes52. Circular plots were created using Circos version 0.69-6 (ref. 65). The variance in metabolite levels explained by the index SNP of an mQTL was computed independently of other covariates.

We compared our findings to those from seven large studies of the plasma–serum metabolome that were published in peer-reviewed journals and shared their summary statistics6,7,8,9,10,11,12. These studies were selected to maximize overlap with our findings as studies of EA participants with large sample size that examined the effects of common SNPs on plasma–serum metabolite levels quantified with the Metabolon assay, rather than on rare variant association studies, or GWAS of metabolites quantified by different methods and/or in other populations, for example, refs. 66,67,68,69,70. Metabolites were matched by compound or chemical ID, if available, and biochemical name (ones not identical were checked manually). First, for each mQTL identified in one of the published plasma–serum studies mentioned above, available index SNPs were extracted from GWAS of the corresponding metabolite in plasma and urine, and effect direction and statistical significance were assessed at different levels of statistical significance (P value < 0.05, <0.05 ÷ no. mQTLs in the previous study, <5 × 10−8 and <5 × 10−8 ÷ no. mQTLs in the previous study). Validation required effect-direction consistency for comparisons involving results from the GCKD plasma mGWAS. Second, for each mQTL identified in this study in plasma or urine, availability of the corresponding index SNP and metabolite in the summary statistics of the previously published plasma–serum studies was assessed. If the index SNP was missing, we searched for proxy SNPs in high LD (r2 > 0.8) within a window of ±500 kb around the index SNP based on genetic data from the 1000 Genome Project phase 3 version 5 of European ancestry using snipa.helmholtz-muenchen.de/snipa/?task=proxy_search. For each study, the best available proxy SNP in terms of maximal LD and minimal distance was selected. Summary statistics were downloaded from metabolomics.helmholtz-muenchen.de/gwas/index.php?task=download (Shin et al.6), www.hli-opendata.com/Metabolome (Long et al.7, only summary statistics with P value < 10−5), omicscience.org/apps/crossplatform/ (Lotta et al.8), pheweb.org/metsim-metab/ (Yin et al.10), omicscience.org/apps/mgwas/mgwas.table.php (Surendran et al.11) and ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/; accession numbers for European GWAS are GCST90199621GCST90201020 (Chen et al.12). Hysi et al.9 shared their summary statistics upon request.

To determine the number of urine mQTLs not reported in our earlier study5, we examined for each mQTL from this study whether an associated SNP within a window of ±500 kb for the corresponding metabolite was identified in the earlier study.

Replication analyses in the ARIC study were performed using log2-transformed metabolite levels and the same covariables. Statistical significance was defined by a Bonferroni procedure (P value < 0.05 ÷ 459 and 0.05 ÷ 430 association tests with matching data for EA and AA, respectively) and consistent effect directions as in the GCKD study.

We included an interaction term between the mQTL and sex in a linear regression model with the same adjustments as before to test for potential differences of the 1,299 mQTLs in men and women. For significant interactions (P value < 0.05 ÷ 1,299), we performed sex-stratified analyses (Supplementary Table 7).

Heritability estimation

A genetic relationship matrix was calculated from all autosomal SNPs with an imputation quality of r2 > 0.6 using GCTA-GRM71. GCTA-GREML72 was then used to estimate the proportion of variation in log2-transformed and, in the case of urine, pq-normalized metabolite levels that can be explained by the SNPs for all metabolites that gave rise to an mQTL.

Independent SNP selection and statistical fine mapping

We identified independent signals within mQTL using approximate conditional analyses, with LD information estimated from our study sample. The fine-mapping regions of mQTL were aligned within matrices across metabolites, if index SNPs were in LD (r2 > 0.8). For each mQTL, the GCTA-COJO Slct algorithm version 1.91.6 (ref. 73) was used to identify independent genome-wide significant SNPs (Pconditional < 3.9 × 10−11), using a collinearity cutoff of 0.1. For mQTL with multiple independent SNPs, approximate conditional analyses were carried out conditioning on the other independent SNPs in the region using the GCTA-COJO Cond algorithm to estimate conditional effect sizes. Statistical fine mapping was performed for all independent SNPs per mQTL. In loci with a single independent SNP, approximate Bayes factors (ABFs) were calculated from the original GWAS effect estimates using Wakefield’s formula74 with a standard deviation prior of 1.33. For mQTL with multiple independent SNPs, ABFs were derived from the conditional effect estimates. The SNP’s ABF was used to calculate the posterior probability for the variant driving the association signal (PPA, ‘causal variant’). Credible sets were calculated by summing the PPA across PPA-ranked variants until the cumulative PPA was >99%. log2-transformed credible set sizes were regressed on the MAFs of independent index SNPs.

Pairwise colocalization tests of plasma and urine mQTL

To examine whether association patterns with metabolites measured in plasma and/or urine are shared across or within matrices, we conducted pairwise colocalization analyses between mQTL. When the windows of ±500 kb around the index SNPs for two mQTLs overlapped, colocalization was performed within the region of the merged windows using a version of Giambartolomei’s colocalization method75 as implemented with the ‘coloc.fast’ function from the R package ‘gtx’ (github.com/tobyjohnson/gtx) with default parameters and prior definitions. To visualize the effect sizes and explained variance for colocalizing signals for mQTLs detected for the same metabolite across matrices (Extended Data Fig. 4), we used the R package ‘circlize’ (ref. 76).

Annotation

SNP annotation was performed by querying the SNiPA database version 3.4 (released 13 November 2020)13, based on the 1000 Genomes phase 3 version 5 and Ensembl version 87 datasets. The retrieved combined annotation-dependent depletion (CADD) score was based on CADD version 1.3. The Ensembl VEP tool was used for the effect prediction of SNPs. SNiPA was used to collect the following annotations for each index SNP: gene hit or close by, regulated genes, CADD score, SnpEff effect impact (exonic and noncoding), mQTL, pQTL, GWAS Catalog, cis eQTL, disease genes (based on ClinVar, OMIM, HGMD and Drugbank) and UK Biobank associations.

To select the most likely causal gene for each mQTL, the following steps were carried out: first, we compiled the ‘genes’ and ‘evidence’ information based on SNiPA13. Index SNPs were queried for association with differential expression of a nearby gene in tubulointerstitial kidney portions (cis eQTL) from 187 patients with CKD using the NephQTL browser77 and GTEx version 8 eQTL data78. Similarly, SNPs were queried for associations with differential levels of nearby proteins in plasma (2,751 unique proteins represented by 3,022 SOMAmers) in data from Sun et al.79 downloaded from www.phpc.cam.ac.uk/ceu/proteins/. Second, when one or more cis eQTL or cis pQTL associations with P < 0.05 ÷ 409 (plasma, 409 unique index SNPs) or P < 0.05 ÷ 410 (urine, 410 unique index SNPs), respectively, was identified within ±100 kb of an index SNP, colocalization analyses of the respective metabolite(s)’ mQTL and each of the eQTL and/or pQTL association(s) were performed within the eQTL–pQTL cis window in the underlying study (gene region ±500 kb) using the method outlined above. Positive colocalizations with gene expression received equal weight for all investigated tissues to maximize the opportunity to detect processes in tissues interacting with blood and being filtered to urine. Sensitivity analyses assigning 1.5-fold and twofold greater weight to colocalizations arising from kidney or liver tissue or from kidney tissue only yielded almost identical results. The evidence codes h, r, e, p, m and c based on SNiPA13 correspond to gene hit or close by, regulated genes, cis eQTL, cis pQTL, missense variants and disease genes based on pathogenic variants known to cause monogenic diseases, respectively. The evidence code E designated genes with evidence for colocalization with gene expression genome-wide association, and P designated those with protein-level genome-wide association. Evidence codes were collected and summed for each gene, where Ee and Pp only counted as one. The gene with the highest sum of scores within each locus was assigned as the most likely causal gene. In the case of ties, genes with evidence for gene expression colocalization were prioritized, followed by protein-level colocalization, followed by genes for which an inborn error of metabolism with the corresponding metabolite is known. When ties still remained, Ee scores were prioritized over E scores and Pp scores were prioritized over P scores. In all other cases, ties were resolved by prioritizing the closest gene; prioritization by distance determined the assigned most likely causal gene at 17% (221 of 1,299) of mQTLs. Lastly, the prioritized gene list was manually reviewed for biological plausibility based on published evidence and at colocalizing mQTLs as outlined in the Supplementary Methods. In case of a clear biological fit to another scored gene (that is, corresponding monogenic disease or animal model), the prioritized gene was reassigned. This final gene list (n = 282) was used as input for downstream gene-based analyses. Known drugs were annotated for each gene and the corresponding indication and status of approval based on platform.opentargets.org/.

Relation of mQTLs to plasma proteins in trans and phenotypes

We also performed colocalization analyses of mQTLs with disease outcomes and biomarker measurements in the UK Biobank, with two representative kidney function traits and with trans pQTLs using the precomputed pQTL data from Sun et al.79 to gain insights into clinical consequences and potential molecular mediators of mQTLs. Association summary statistics between SNPs and 30 biomarkers from the UK Biobank baseline examination, including the liver function markers AST, ALT, GGT, bilirubin and albumin, were computed using BOLT-LMM80 (application no. 20272) in the same subset of European-ancestry participants as previous studies81. Precomputed GWAS summary statistics of diseases as ascertained in the UK Biobank and analyzed using phecodes were obtained from www.leelabsg.org/resources (1,403 binary traits) and from yanglab.westlake.edu.cn/data/ukb_fastgwa/imp_binary/ (2,325 of 2,989 binary traits82; traits containing job-coding terms were excluded from the analysis). There were 816 phecodes analyzed in both, but only unique phecodes were counted for positive colocalizations. We used GWAS summary statistics of creatinine-based and cystatin C-based eGFR (eGFRcrea and eGFRcys) from Stanzick et al.83, who meta-analyzed kidney function GWAS from the CKDGen Consortium and the UK Biobank. The GWAS summaries were downloaded from the CKDGen data website at ckdgen.imbi.uni-freiburg.de. Colocalization testing between mQTL and trans pQTL was performed within a window of ±500 kb around the mQTL’s index SNP when at least one trans pQTL association with P < 0.05 ÷ 409 ÷ 3,000 for plasma and P < 0.05 ÷ 410 ÷ 3,000 for urine was present within a window of ±100 kb around the index SNP. Similarly, colocalization analysis between mQTL and biomarkers, diseases and kidney function traits was performed within ±500 kb of the index SNP when there were one or more associated variants with MAF > 0.01 and P < 0.05 ÷ 409 or P < 0.05 ÷ 410, respectively, within ±100 kb of the index SNP, using the method outlined above.

Moreover, we investigated whether the most likely mQTL-related genes contained rare, putatively damaging variants that in aggregate are associated with clinical traits and diseases. Gene–phenotype associations based on whole-exome-sequencing data from ~450,000 UK Biobank participants were obtained on 4 February 2022 from the AstraZeneca PheWAS Portal (azphewas.com/) for the 274 available genes of the 282 mQTL-related genes22. We identified 2,745 distinct suggestive (P < 1 × 10−5) gene–phenotype associations for 115 of those genes. The significance threshold as derived for the PheWAS was 2 × 10−9 (ref. 22). Only the most significant collapsing model per trait was retained for Fig. 4b. In addition, the exome-wide variant-level results were downloaded on 26 August 2022. The 17,493 analyzed phenotypes were queried for significant (P value < 0.05 ÷ 17,493) associations with mQTL index SNPs (Supplementary Table 16).

We further performed colocalization testing of independent signals for all the 12 identified mQTLs within the DPEP1 genomic region and plasma proteins with a reported pQTL in the DPEP1 locus46. Metabolite and plasma protein summary statistics were extracted with a 500-kb flanking region around DPEP1 and the DPEP1 mQTL index SNP for the proteins CPB1, AMY2B, PNLIP, AMY2A, REG3G, CTRB2 and PNLIPRP1. First, independent association signals were identified based on approximate conditional analyses via the GCTA-COJO Slct algorithm (P value < 5 × 10−8; collinearity threshold = 0.1)73. For each conditionally independent SNP, conditional summary statistics were computed by conditioning on all other independent SNPs in the region using the GCTA-COJO Cond algorithm (collinearity threshold = 0.1)73. Subsequently, colocalization analyses were conducted for all pairwise combinations of the conditionally independent mQTL and pQTL associations as outlined above. For the gallstone disease GWAS84 and urine glycocholate, we performed colocalization analysis of signals conditioning on the plasma index SNP (rs55971546) within ±500 kb of the SLC10A2 urine mQTL index SNP (rs16961281). The same conditional mQTL summary statistics were colocalized with kidney eQTL38. Marginal statistics were used for these, as rs55971546 was not available (FDR > 0.01).

Processing of gene expression data from tissue and cell types

To test for over-representation of plasma or urine mQTL-related genes among those highly expressed in specific tissues and cell types, we compiled bulk and single-cell gene expression (RNA-seq) datasets. These included GTEx version 8 (ref. 78), the Human Liver Cell Atlas85, a single-cell dataset and a single-nucleus dataset from the human kidney86,87, a single-cell dataset from the mouse kidney88, a single-cell dataset from the human intestine89 and a single-nucleus dataset from the kidneys of patients with CKD from the Kidney Precision Medicine Project (KPMP)90. Except for the KPMP, data sources and processing followed the workflow published by Cheng et al.91. KPMP data were downloaded from the KPMP Kidney Tissue Atlas repository at atlas.kpmp.org/repository as Seurat-format files and were subsequently processed in Seurat92 similar to the other datasets. For generation of the top 10% highly expressed genes for each tissue and cell type in each dataset, we followed the workflow published by Schlosser et al.5.

GO, KEGG, tissue and cell type enrichment analyses

Enrichment testing of the 282 identified genes was performed as follows. The number of independent SNPs per gene was computed using GCKD genotypes (PLINK version 1.90 (ref. 93)), and a database of Entrez gene identifiers based on org.Hs.eg.db version 3.8.2 was generated. Gene annotation included the number of independent SNPs per gene, gene length, GO terms94 and KEGG pathways95, as well as being Human Protein Atlas tissue or group enriched96; Human Protein Atlas cell type enhanced, enriched or group enriched97; being a VIP gene from PharmGKB (accessed 5 December 2020)98; being a gene with an actionable drug interaction from the Clinical Pharmacogenetics Implementation Consortium (levels A, A/B and B; accessed 13 January 2021)99; and being among the top 10% highly expressed genes in each GTEx version 8 tissue78 and human85,86,87,89,90 and murine cell types88. We performed 100 million random draws of an equal number of genes as contained in the respective source list (combined mQTLs, 282; plasma mQTLs, 214; urine mQTLs, 195; plasma-only mQTLs, 87; urine-only mQTLs, 68), matched for deciles of the number of independent SNPs and deciles of gene length and compared any overlap with cell types, tissues and terms with the ones identified for the original source list. Multiple-testing correction was performed using the Benjamini–Hochberg procedure100.

Lastly, we tested for over-representation of certain phenotypes among mice in which the implicated genes had been genetically manipulated. The phenotype terms ‘abnormal homeostasis’ (MP:0001764) and ‘abnormal metabolism’ (MP:0005266), all of their child terms and all genes associated with these terms were downloaded from MouseMine101 on 9 December 2021. Mouse genes were mapped to their human homologs using the getLDS function from the biomaRt package102. Human and mouse genes that did not map to a homolog in the respective other species were excluded from the analysis. This excluded 861 of 6,051 abnormal homeostasis genes, 61 of 952 abnormal metabolism genes and ten of 282 mGWAS genes (PYCRL, GBA3, PPDPFL, CETP, NAT16, ZNF680, ENOSF1, ACSM6, FUT3, ZNF675). The genes identified from urine, plasma or both were tested for over-representation among the genes belonging to each of the phenotype terms using Fisher’s exact test (with the universe set to 13,151 genes, the number of mouse genes that mapped to human homologs in the Mouse Genome Informatics database), followed by Benjamini–Hochberg correction for multiple testing.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Read more here: Source link