Introduction
,
where genetic predisposition is shared across thousands of mostly common variants with individually modest effects on population risk. Since 2005,
GWASs have successfully identified thousands of genomic regions significantly associated with common diseases, with notable successes in type 2 diabetes (T2D)
- Vujkovic M.
- Keaton J.M.
- Lynch J.A.
- Miller D.R.
- Zhou J.
- Tcheandjieu C.
- Huffman J.E.
- Assimes T.L.
- Lorenz K.
- Zhu X.
- et al.
Discovery of 318 new risk loci for type 2 diabetes and related vascular outcomes among 1.4 million participants in a multi-ancestry meta-analysis.
and coronary artery disease.
This approach was successfully applied at the start of the coronavirus disease (COVID) global pandemic in 2020, with newly established international collaborations driving COVID-19 GWASs and making all data publicly available.
GWAS datasets are increasingly publicly shared, and these datasets are widely used to further basic research, as well as translation, including in drug-discovery pipelines.
,
,
Recent publications have included GWASs of ∼4,000 brain-imaging traits,
∼1,500 protein biomarkers,
and 778 traits in the UK Biobank (UKBB).
Dense imputation panels have increased the number of variants analyzed, with a typical GWAS now including more than 8 million variants. GWAS analytical methods are also beginning to be applied to whole-genome sequencing data,
with the potential for vastly increased coverage of the genome and inclusion of rare variants.
was established with the aim of providing a central repository for variant-trait associations identified through GWASs, serving as a starting point for investigations to identify causal variants, understand disease mechanisms, and establish targets for new therapies. GWAS datasets are both submitted by the research community and identified via the peer-reviewed literature and then curated and annotated according to transparent standards by GWAS Catalog curators and made available via a user-friendly web-based search interface. As of June 2021, the Catalog contains more than 5,000 publications containing more than 20,000 individual GWASs, with more than 250,000 top associations (p −5). Downloadable flat files and a representational state transfer application programming interface (REST API) provide flexible access to the data. Data from the GWAS Catalog are openly shared and re-usable, which has enabled integration into numerous other reference databases, such as Ensembl and the Open Targets resources.
,
and also via a dedicated API. All data are made available freely and without restriction or registration requirements, in contrast to the controlled access offered by related resources, such as the Database of Genotypes and Phenotypes (dbGaP).
The reasons given for limited sharing include technical challenges, concerns regarding data misuse, privacy concerns,
and the perceived lack of an appropriate repository. In addition, for those that do share GWAS SumStats, they are often not submitted to a centralized repository and are instead made available only on dispersed project-centric websites, presented in a range of different formats, and largely lacking rich, searchable metadata. The lack of a centralized repository and global standards for data content and format presents challenges for users who must find, harmonize, organize, and manage the data before analyses.
- 1
Ensuring SumStats and study metadata are findable, accessible, interoperable, and reusable (FAIR)
and relevant to the user community
- 2
Establishing a community standard for reporting GWAS SumStats and metadata
- 3
Identifying strategies to incentivize sharing of SumStats
Here, we review the GWAS SumStats standards and sharing workshop proceedings and community discussions. We report our recommendations and planned implementations to realize the broad sharing of GWAS SumStats and to ensure their FAIRness. Our recommendations include timely deposition of datasets in the GWAS Catalog, as the recognized central repository, and standards for reporting elements and formats.
Workshop proceedings
Data content
for SumStats, which depends on use case and stakeholder needs (see Box 1, Workshop recommendations 8, working group on “Data content and format”).
GWAS SumStats and accompanying metadata should be versioned to enable users to identify the most recent dataset. The GWAS Catalog will develop a data update and versioning strategy to meet those needs. Linking from GWAS SumStats and metadata to relevant datasets in other databases (e.g., dbGaP, EGA, BioData Catalyst, and AnVIL) should be improved. The GWAS Catalog will develop improved cross-linking to relevant databases.
To further assess and finalize metadata content, variant identification, and file format requirements, including for association testing with multiple variants in a region, we are convening a working group (“Data content and format” working group).
effect allele, other allele, effect size (odds ratio or beta), and standard error (Table 1; Box 1, Workshop recommendations 5). Alternative ways of representing variants were discussed because it is recognized that using reference SNP IDs (rsIDs) or genomic location does not facilitate unambiguous identification of all variants. Attendees also suggested that variant representation should be compliant with Global Alliance for Genomics and Health (GA4GH) standards (vrs.ga4gh.org/en/stable/) and be able to represent haplotypes. It was also noted that the standard should specify the level of detail required for each value, for example, the number of significant digits.
Table 1Recommended standard reporting elements for GWAS SumStats
Data elements have been recommended as mandatory if >50% of pre-workshop survey respondents indicated that preference.
aWe agreed that other variant ID formats should be supported. Implementation of those standards will be addressed by the working group “Data Content and Format.”
Although the sharing of SumStats poses low risk to participants’ privacy, there could be a small risk of identifying individual level data, and those risks are greater for certain studies, such as those that include individuals from isolated populations or with rare traits. We agreed that it was important to acknowledge the potential for risk by specific study criteria and to provide guidance on how to minimize risk. It was suggested that the requirements for data sharing could be different for studies that have determined sensitive datasets: for example, the risk of identification could be reduced by not requiring the sharing of study-specific minor allele frequencies (MAFs) or reducing the decimal points required for p values.
FAIR
taking into account the needs of users (Table 2). We discussed which of those indicators are already being met and where improvements are required.
Table 2FAIR indicators
Our recommended FAIR indicators for GWAS SumStats. We list each core FAIR principle and the associated indicators and provide examples of how they are implemented in the GWAS Catalog.
We agreed that unique and persistent accession IDs must be provided for SumStats at the point of submission of the dataset to a database and prior to publication of the study in a journal. This allows journals to check that the dataset is accessible and for the inclusion of accession IDs in the publication. For the reporting of SumStats, most attendees agreed that the following metadata elements should be mandatory: sample size (including number of cases/controls), sample ancestry, imputation method and reference panel, covariates, trait measurement (e.g., self-reported versus clinically diagnosed), sample inclusions/exclusions, additional cohort descriptors (e.g., cohort names), analysis plan (e.g., model and software used), genotyping/sequencing technology, minor allele frequency cutoff, trait quality control, and number of variants analyzed. Attendees discussed that there is an incentive to meet only the minimum requirements; therefore, those requirements should include all useful information; otherwise, those data may not be shared. There were differing opinions on the preferred metadata format, either incorporated within the SumStats file or in a separate file, and that format needs further discussion. However, it was agreed that representing metadata using a standard file can be challenging, and tools to support users in that would be extremely beneficial.
Incentivization of sharing
The aim of this session, chaired by Orli Bahcall, was to identify barriers to sharing of GWAS data and define strategies to overcome those barriers, including identifying incentives for data sharing. From her experience in working on the development of data-sharing programs and with a broad range of GWAS producers, she proposed that the barriers to sharing and the strategies required to overcome them differ among GWAS producers who want to share the dataset but meet challenges and those who are reluctant to share from the outset.
Most of the challenges faced by GWAS producers who are amenable to data sharing can be reduced or eliminated by the presence of a suitable repository that supports a submitter’s needs: ease of submission, short waiting times, clear requirements, provisioning of an accession identifier at time of submission, support for versioning, ability to submit the dataset early (soon after generation and before posting a first preprint manuscript reporting the dataset), optional access control, and setting embargoes.
,
In relation to privacy and de-identification issues, however, those barriers have been addressed by the 2018 NIH statement (see web resources). Compounding that issue are concerns over consent and regulatory requirements; there may be a lack of either transparency or clarity on whether participants’ consent agreements allow for sharing of SumStats. Second, even though genomics has paved the way in data sharing for biological sciences and is the most progressive community in commitment to open science, a widespread culture of data ownership continues. These data producers maintain private or restricted ownership of their data in the interest of competitive advantage for publications and other research outputs.
Infrastructure requirements
Infrastructure for data management and storage is essential to enable sharing of GWAS data and to support data deposition, hosting, and distribution. In this session, chaired by Mike Inouye, we aimed to evaluate stakeholder infrastructure and data hosting requirements.
- Zheng J.
- Erzurumluoglu A.M.
- Elsworth B.L.
- Kemp J.P.
- Howe L.
- Haycock P.C.
- Hemani G.
- Tansey K.
- Laurin C.
- Pourcain B.S.
- et al.
LD Hub: A centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis.
and MR-Base).
In this model, the repository serves as the intermediary, supporting submission by data generators and access by data users. This presents the question of where the burden of formatting data should be placed. Most participants felt that this burden should lie with submitters, who can facilitate the validation of submitted data to support harmonization and downstream uses. To mitigate that burden, formatting and validation tools for submitters are essential, along with support for submission of large volumes of data, accession IDs being provided upon submission, versioning support, and protocols that are free to use. For users of SumStats, the most important requirement is access to harmonized data. This should be supported by flexibility in access methods, including filtering across SumStats, robust APIs, and dataset download.
Data update cycle
In this session, chaired by Raymond Walters, we discussed the requirements for the GWAS SumStats data update cycle, including when datasets should be submitted and how to handle updates and versioning.
Forward look
In this session, chaired by Stephen Rich, we considered requirements for alternative GWAS design and emerging technologies.
of GWAS Catalog data from 167 publications, we found that the reporting of aggregate association results is extremely variable, with minimal information included as standard in shared SumStats, often only locus ID and p value. Workshop attendees agreed that there is a need to standardize the reporting of those aggregate-association tests, both in how the tests are performed and also for the results, including the set of variants that contribute to each test. We agreed that standard reporting guidelines need to be defined for SumStats and metadata from GWAS testing association with multiple variants in a region. We will review further to establish a definitive list of required elements and standard format, as part of the “Data content and format” working group (Box 1, Workshop recommendation 8).
Acknowledgments
We thank workshop participants for their engagement and contributions, members of the community for completing the pre-workshop survey, and Raymond Walters for chairing the session on “Data update cycle.” Research reported in this publication was supported by the National Human Genome Research Institute of the National Institutes of Health under award no. U41HG007823 and EMBL-EBI Core Funds. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. In addition, we acknowledge funding from the European Molecular Biology Laboratory . Workshop chairs acknowledge the following funding: I.B., “Expanding excellence in England” award from Research England ; S.S.R., NIH award R01 HL105756-08 ; R.J.C., NHGRI award U24HG010262 .
Declaration of interests
J.A.L.M.’s immediate family member is an employee and shareholder of Illumina, Inc. P.L.W. is employed by an SME with an interest in GWAS, but the work described in this publication predates this employment.
Read more here: Source link