The IPDGC/GP2 Hackathon – an open science event for training in data science, genomics, and collaboration using Parkinson’s disease data

GWAS-level and post-GWAS analyses

GWAS of PD have nominated 90 independent risk signals in individuals of European ancestry, explaining ~16–36% of the heritable risk7, as well as two additional risk signals in Asian populations8. Typically, published GWAS are accompanied by various follow-up analyses, but performing these analyses is not always straightforward. Common post-GWAS analyses include heritability estimation and polygenic risk score (PRS) calculation. Heritability analyses estimate the percentage of disease risk accounted for by common genetic variants, and PRS can be used to predict disease risk by aggregating the effects at multiple common risk loci9. Another follow-up approach to help interpret GWAS results is to combine Single Nucleotide Polymorphisms (SNPs) into a group of functionally related genes, such as genes belonging to a single biological pathway or cell type. This is called gene set enrichment analysis and is a widely used approach to examine the cumulative effect of SNPs in a particular biological process and determine whether there are particular pathways, processes, or cell types affected in disease.

Project 1 (post-GWAS analysis)

We used AMP PD version 1 release data to develop a Terra-based pipeline for assessing SNP-based heritability, as well as polygenic risk score calculation. Using the GREML-LDMS method10 applied to data from the AMP PD version 1 release, we estimated narrow-sense heritability (h2) to be roughly 52%, which is much higher than the typical estimate of 22%7, and is likely biased due to the recruitment of specific variant carriers present in the AMP PD cohort. AMP PD provides information regarding the recruitment arm. Researchers should investigate this information to determine if specific samples need to be removed from certain analyses in future work. We then used PLINK v1.911 and estimated risk effect sizes from the summary statistics of the latest PD GWAS7 to calculate the genetic risk scores of PD LRRK2 mutation carriers (n = 382), control LRRK2 mutation carriers (n = 275), and control individuals without PD causing mutations (n = 3435) from the AMP PD version 2.5 dataset. We tested normalized z-scores for association with LRRK2 carrier disease status. Mean ± standard deviation of unadjusted PRS scores were higher in PD LRRK2 mutation carriers (−0.0166 ± 0.004), compared to control LRRK2 mutation carriers (−0.0182 ± 0.004) and controls (−0.0180 ± 0.004), suggesting that PD LRRK2 mutation carriers share a common polygenic risk profile with idiopathic PD (iPD), contrary to control LRRK2 mutation carriers (Fig. 2a).

Fig. 2: Results from the GWAS-level and post-GWAS analyses projects.
figure 2

a Violin plots comparing z-transformed Parkinson’s disease (PD) genetic risk score distributions in PD-LRRK2 cases, non-PD-LRRK2 carriers, and controls. Within the violins, box plots display the median and the bounds of the box correspond to the 25th and 75th percentiles. The upper and lower limits of the whiskers correspond to 1.5 times the limits of the 25th and 75th percentiles. PD-LRRK2 individuals had a higher risk of developing PD compared to control LRRK2 mutation carriers (OR = 1.60, 95% CI = 1.33–1.93, P = 1.1 × 10–6). The mean of the unadjusted GRS score was also significantly higher in PD-LRRK2 cases compared to non-PD-LRRK2 carriers (P = 2.9 × 10–6) and controls (P = 5.1 × 10–7) in the pairwise Wilcoxon rank-sum test. b Summary of input genes from WebGestalt showing the number of PD genes (from GWAS significant SNPs) which overlap with the annotated genes in the Gene Ontology Slim terms from the biological process, cellular component, and molecular function.

Project 2 (pathway and cell enrichment pipeline)

We aimed to create a pipeline to annotate GWAS summary statistics to test the enrichment of biological pathways and cell types. Analyses were performed on the Terra platform using the most recent PD GWAS summary statistics7. We created a pipeline to correctly format the GWAS summary statistics for a common annotation tool Functional Mapping and Annotation of Genome-Wide Association Studies (FUMA)12, then downloaded the formatted data and uploaded it to FUMA. We also ran the WEB-based GEne SeT AnaLysis Toolkit (WebGestalt) directly on Terra using the WebGestaltR package13. We selected the nearest genes to the GWAS significant SNPs (P < 5 × 10−8) from Nalls et al. 2019 GWAS summary statistics. Using WebGestaltR, we conducted an overrepresentation analysis and gene set enrichment analysis. We identified 97 unique genes from the genome-wide significant hits in the PD GWAS summary statistics. We generated summary data for these PD genes annotated by biological processes, cellular components, and molecular functions (Fig. 2b). There was no significant enrichment of any Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway gene sets in the overrepresentation analysis (FDR P < 0.05).

Downstream analyses of genetic variation

While GWAS have identified many common variants associated with complex diseases like PD, it is follow-up analyses that have started to decode GWAS results, and more downstream analysis is needed to unravel the implications of observed genetic variation in PD. Three types of analysis were the focus for the downstream analyses of the genetic variation topics, including colocalization, variant interaction, and network generation and visualization. Colocalization analysis allows the calculation and estimation of the correlation between a GWAS locus and an expression quantitative trait locus (eQTL). Variant interaction, or epistasis, is an interaction of genetic variation at two or more loci to produce a phenotypic outcome that is not predicted by the additive combination of effects attributable to the individual loci14. Its importance in humans continues to be a matter of debate15,16, but it may explain some of the “missing heritability” underlying complex diseases such as PD16,17,18. In addition to investigating individual variant effects with colocalization and epistasis, visualizing biological networks can help with understanding complex molecular relationships and interactions. In PD research, genetic and gene expression data has been used in community network analysis to nominate pathways and genes for drug target and functional prioritization19,20.

Project 3 (colocalization pipeline)

Colocalization analysis takes into account five hypotheses: H0 (no association between the locus and either trait), H1 (locus has an association with first trait only), H2 (locus has an association with second trait only), H3 (locus has an association with both traits but driven by different SNPs which are not in linkage disequilibrium (LD)), H4 (locus has an association with both traits driven by same SNPs). For Project 3: Colocalization pipeline, we considered colocalization analysis with a posterior probability of colocalization in H4 (PPH4) greater than 0.8 to be significant. We utilized the coloc R package21,22 and summary statistics from ref. 7. We used eQTL data from a cerebellar cortical meta-analysis of four cohorts23, publicly available from the AMP-AD Knowledge Portal24. As an example for our pipeline, we extracted the region ± 500 kb around DYRK1A, nominated in Nalls et al. 2019, from the GWAS summary statistics and eQTL data. To visualize the results, we employed the eQTpLot R package25, which can generate different plots for GWAS and eQTL signal colocalization, as well as the correlation between their p values and enrichment of eQTLs among variants and LD of loci of interest, allowing efficient and intuitive visualization of gene expression and trait interaction. We used our previously generated results for DYRK1A and whole brain eQTL as an example for creating visualizations using this package. (Fig. 3a).

Fig. 3: Results from the downstream analysis of genetic variation projects.
figure 3

a Displays the locus of interest, in this case, ±500 kb from DYRK1A, and the horizontal line depicts the GWAS significance threshold of P = 5 × 10–8. Displays the genes in the locus of interest. b Depicts the Leiden gene networks and correlations for significant eQTLs for PD controls and PD cases. c Depicts the general workflow for the variant interaction pipeline.

Project 4 (network generations and visualization pipeline)

We sought to develop a Leiden network and subsequent visualization pipeline for transcriptomic and genomic data to identify and visualize both a priori and complex phenotype gene regulatory networks. The Leiden algorithm is one option for community detection of networks and can be faster and return more reliable results than the more well-known Louvain algorithm26. We relied on the leidenalg27 package in Python to produce weighted and unweighted networks on GWAS summary statistics and then visualized the resulting networks. (Fig. 3b). Data used consisted of AMP PD genomic, transcriptomic data, and public eQTL data from the eQTL catalog28 and PD summary statistics from the most recent PD GWAS7. This project was designed as a proof-of-concept for a pipeline for detecting gene networks and relating them to PD phenotype information via GWAS summary stats.

Project 5 (variant interaction pipeline)

We developed a workflow that can be summarized as follows: (1) We utilized individual-level test data in binary format to perform data harmonization with PLINK v1.911 to ensure that the risk allele was consistent for all the variants; (2) We established a minor allele frequency (MAF) threshold >0.05 to subset variants, keeping only common genetic variation; (3) We annotated variants of interest using ANNOVAR29, differentiating between coding and non-coding as well as annotated predicted gene consequence; (4) We carried out interaction analyses in R 3.6 using the glm() function and adjusting for age, gender, and the first five components. (Fig. 3c).

Data visualization

Visualization of clinical and genetic data plays an essential role in research. It can be used to inform the progress of initiatives like GP2, help researchers to view data in a meaningful way, and generate and corroborate hypotheses. As GWAS and other analyses nominate more PD risk loci, efforts to decode the role of these variants and how they interact with both longitudinal and cross-sectional phenotypes will be needed. Four projects focused on data visualization, including a GP2 cohort tracker, updates to the IPDGC locus browser, visualization of longitudinal clinical phenotypes, and visualization of longitudinal and cross-sectional variant effects.

Project 6 (GP2 cohort tracker visualization)

We designed the GP2 cohort tracker visualization to show essential information about cohorts recruited for GP2 and showcase their diversity, geographical location of enrollment, ancestry representation, and additional relevant metadata. We designed this visualization to inform progress and inspire others to contribute to this initiative. In the form of a one-page dashboard developed with the open-source Python software Streamlit, the visualization includes separate maps for complex and monogenic cohorts. It was critical to include easy-to-use search and discovery aspects built into the dashboard. If a user knows the name of a particular cohort, then they can pull up information for that cohort that populates the rest of the dashboard. The user can also filter by general methods such as cohort size or country. This design is used internally and externally on the GP2 website (gp2.org/cohort-dashboard/) to inform those interested in the progress of GP2’s cohort integration (Fig. 4a).

Fig. 4: Results from the data visualization projects.
figure 4

a The left banner allows for filtering and specific cohort selection, the map depicts cohort origin, and the right panel depicts PD vs. non-PD distribution. b (A) Locus zoom plot generated using conditional analysis statistics for locus 78 PD risk variant rs2248244. (B) Violin plot of GBA expression in AMP PD cases and controls. (C) Example plot of browser user visits over time. c Depicts an example image from the app, in this case, a scatter plot visualization of UPDRS2 scores across visits, color-coded for sex. d On the left, users can input their query parameters, including variant, biomarker(s), and cohort(s) of interest. On the right, a forest plot demonstrates the regression beta for the variant of interest in cross-sectional data. A bar plot demonstrates the number of participants in each cohort. The exact visualization is also available for longitudinal data, with all available data available in a tabular format in the “Data Table” tab.

Project 7 (IPDGC GWAS loci browser expansions)

To facilitate investigations of nominated risk variants, members of IPDGC have created a PD GWAS locus browser (pdgenetics.shinyapps.io/GWASBrowser/) that makes relevant statistics and datasets available to the public30. Throughout the hackathon, our team continued the development of this browser through the addition of new datasets and features. To identify secondary association signals at each locus from the Nalls et al. 2019 study, we performed conditional analysis using the Genome-wide Complex Trait Analysis (GCTA) tool31,32. Locus zoom plots were added to display the results of this conditional analysis (Fig. 4b)33. Power calculations were done for each risk variant by Nalls et al. 2019 to determine if the findings were sufficiently powered. To do so, we followed methods used by the Genetic Association Study Power Calculator tool (csg.sph.umich.edu/abecasis/gas_power_calculator/), using summary statistics from Nalls et al. 2019, a disease prevalence of 0.01, and a significance level of 0.05 as input. We queried blood gene expression data included in the AMP PD version 2.5 release to measure expression levels in PD cases and controls. We obtained TPM expression at baseline for samples that had case or control status and no PD mutations in whole-genome sequencing data, leaving a total of 1710 samples. Expression data for each gene was displayed in a violin plot and added to the expression section of the browser (Fig. 4b). The literature section of the browser was updated to display a description, PubMed hit count, and word cloud plot for each gene within 1 MB of a PD risk variant. Our last addition to the browser was a display of user statistics. We used the googleAnalyticsR package34 to record and visualize the number of visits for the browser and each risk variant within a period specified by the user (Fig. 4b).

Project 8 (visualization of longitudinal UPDRS/HY scores)

The Unified Parkinson’s Disease Rating Scale (UPDRS) and Hoehn-Yahr (HY) stage are two of the most common measures of the severity of PD. We set out to develop a user-friendly and adaptable app to display a diverse set of visualizations of longitudinal UPDRS/HY scores, based on data from GP2, utilizing the Streamlit library from Python. During the Hackathon, we successfully integrated data from three cohorts: Parkinson’s Progression Markers Initiative (PPMI), Parkinson’s Disease Biomarkers Program (PDBP), and BioFIND35,36,37. We were also able to produce four different visualizations (Fig. 4c). First, we created bar graphs to visualize changes in scores from the data over time, and we added the option to include baseline patients only. Second, we created line graphs showing confidence intervals for longitudinal changes in HY and UPDRS scores. Our final visualizations were experimental, but we produced proof-of-concept visualizations with limited options. We created a Sankey graph visualization that better visualized how participants moved between different subsets of the population over time. Lastly, the fourth visualization is a Kaplan–Meier curve showing the time to reach a certain threshold within our progression scores.

Project 9 (Visualize longitudinal and cross-sectional variant effects)

We set out with the aim of creating an interactive and user-friendly web application that would allow users to (i) visualize the effect of a genetic variant across multiple cohorts using publicly available GWAS summary statistics and (ii) input their GWAS summary statistics for visualization and meta-analysis with existing data. As test data, we used a small subset of results from a study of amyloid-β levels in cerebrospinal fluid derived from healthy control individuals and individuals with PD from the PPMI dataset35. Amyloid-β levels were measured at baseline and in follow-up visits; thus, results were available from both a cross-sectional and longitudinal GWAS. Using the R shiny38 framework, we produced a skeleton framework for our web application, with several tabs, including (i) an “Upload data” tab where users could upload and query their data and (ii) a “Plots” tab where users could query the available test data and visualize it. In the “Plots” tab, we allowed users to query by a variant of interest, with the option to choose which biomarker(s) and cohort(s) they wished to visualize. The variant beta was visualized across cohorts and biomarkers using a forest plot, while the number of participants/observations was visualized using a bar plot (Fig. 4d). Tabs were available for cross-sectional and longitudinal plots, with the option to download the plots, and finally, data was also made available in a tabular format.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Read more here: Source link