Wavelet Screening: a novel approach to analyzing GWAS data | BMC Bioinformatics

Haar wavelet transform

Our method transforms the raw genotype data similarly to the widely used ‘Gene- or Region-Based Aggregation Tests of Multiple Variants’ method [15] (Fig. 1). Like the Burden test, the effects of the genetic variants in a given region are summed up to construct a genetic score for the regression. The first step in our analysis is the application of FDWT to the multi-SNP data. In the next subsection, we introduce the Haar wavelet transform and show how the wavelet coefficients are computed. Readers unfamiliar with wavelets are referred to a comprehensive introduction by Nason [16]. In the rest of this article, ‘wavelet’ specifically refers to the Haar wavelet.

Fig. 1

Genetic variation in one individual within a locus spanning two million base pairs (2 Mb), including 10,000 imputed SNPs

We code a SNP 0 if an individual is homozygous for the reference allele (usually assigned to the more frequent or ‘major’ allele), 1 if heterozygous, and 2 if homozygous for the alternative allele (the less frequent or ‘minor’ allele). This is the standard way of coding alleles in an additive genetic model [3]. Let (G_{0,k}(bp)) denote the ‘true’ genetic signal of individual k at physical position bp (base pair), and let (G_{k}(bp)) be the observed, imputed version of (G_{0,k}(bp)). We assume that

$$begin{aligned} G_{k}(bp)&= G_{0,k}(bp)+ epsilon _{k}(bp) end{aligned}$$


where (epsilon _{k}(bp)) are independently and identically distributed (iid) over individuals, with (mathrm {Var}(epsilon _{k}(bp))=sigma ^2(bp)). The variance (sigma ^2(bp)) at position bp can be interpreted as a function of the imputation quality IQ(bp), which has a value in (left[ 0,1right]). 1 represents a perfectly imputed SNP or genotyped SNP; hence, (sigma ^2(bp) propto 1-IQ(bp)). As the data used here were already quality-controlled, only SNPs with an (IQin left[ 0.8,1right]) were retained for further analyses. We assume that the imputation metrics are independent and heteroscedastic over bp. As the value of a SNP is in ({0,1,2}) and then in (left[ 0,2right]) according to the dosage convention after imputation [3], the distribution of (epsilon _{k}(bp)) is not straightforward. However, as our model is calibrated by simulations, this error distribution does not have to be specified.

We define a genetic region (GR_{lb,ub}) (GR, genetic region; lb, lower bound; up, upper bound) on a given chromosome as the set of physical positions bp in the interval (lb< bp < ub). In the rest of the paper, we assume the analyses are performed within a fixed genetic region (GR_{lb,ub}) on a given chromosome. We observe the value of (G_{k}(bp)) at pre-determined and increasing positions (bp_1,…, bp_n) within the interval (lbub), with some error due to the genome-wide imputation process [17]. For now, we assume having (n = 2^J) equally spaced observations within (GR_{lb,ub}) and denote the observed value of (G_{k}(bp_i)) by (g_k(bp_i)); i.e., the data value measured on individual k at position (bp_i), (i = 1, ldots n), where the (bp_i’s) are equally spaced. We define wavelet d and c coefficients as sequences of length (2^J). These coefficients are computed by Mallat’s pyramid algorithm [11].

For the coefficients at the highest scale (i.e., scale (J-1)), for (i in {1,ldots ,2^{J-1}}),

$$begin{aligned}&d_{J-1,i}= g_{k}(bp_{2i}) -g_{k}(bp_{2i-1})\&c_{J-1,i}= g_{k}(bp_{2i}) +g_{k}(bp_{2i-1}). end{aligned}$$

These coefficients correspond to local differences (or sums) of the measured values. For lower scales, the coefficients are computed as follows:

$$begin{aligned}&d_{j-2,i}= c_{j-1,2i} -c_{j-1,2i-1}\&c_{j-2,i}= c_{j-1,2i} +c_{j-1,2i-1}. end{aligned}$$

Finally, the coefficients at the lowest scale (i.e., scale 0) are computed as:

$$begin{aligned} d_{0,1}=c_{0,1}= sum _{i=1}^{2^J} g_{k}(i). end{aligned}$$

These procedures are often written as square matrices (W_d) and (W_c) (d and c procedures, respectively) of size (2^J), where the rows of (W_d) and (W_c) are normalized. We have (d = Wg_{k}) and (c = W’ g_{k}). In addition, because the matrix (W_d) is orthogonal, we have:

$$begin{aligned} ||d ||^2= left( Wg_{k}right) ^t Wg_{k} =||g_{k} ||^2. end{aligned}$$

Using the (2^J) wavelet coefficients for individual k, all values (g_{k}(bp)) in the genetic region (GR_{lb,ub}) can be completely recovered. However, this wavelet transformation assumes that the data are evenly spaced and that there are (n = 2^J) measurements, which may not be realistic in practice. To avoid this assumption, we use the method of Kovac and Silverman [18], which is briefly explained in the “Pre-processing of data” section.

Wavelet representation of the genome

In essence, the coefficients obtained after performing wavelet transform on a genomic region can be viewed as local ‘scores’ of the genotype, with the following interpretations:

  • At scale 0, the wavelet coefficients d and c can be interpreted in the same way: they summarize the discrepancy between an individual’s genotypes and the reference genotypes coded as 0…0. This is essentially the test comparison performed in standard gene or regional tests.

  • The wavelet (d_{s,l}) coefficient at scale (s > 0) and location l for an individual represents the difference in the number of minor alleles between the left part of the region (defined by sl) and the right part.

  • The wavelet (c_{s,l}) coefficient at scale (s > 0) and location l for an individual represents the discrepancy between an individual’s genotypes and the reference genotypes coded as (0ldots 0) for the region defined by sl.

The main rationale behind this modeling is that, if a genetic locus has an effect on the phenotype, then the association is likely to be spread across genomic regions of a given size (scale) at different positions (locations). By using wavelet transform to perform a position/size (or, alternatively, time/frequency) decomposition and then regressing the wavelet coefficients on the phenotype, one can visualize where (location) and how (scale) the genetic signal influences the phenotype.

In the rest of this article, we use ‘wavelet coefficients’ to refer to c coefficients specifically. c coefficients are easier to interpret than d coefficients. For instance, in case of completely observed genotypes, c coefficients correspond to the sum of minor alleles, similar to the Burden test [19].

Pre-processing of data

Non-decimated wavelet transform

We use the method of Kovac and Silverman [18] to handle non-decimated and unevenly spaced data. This method takes an irregular grid of data, for example, the sampling of different genetic regions, and interpolates the missing data into a pre-specified regular grid of length (2^J). For a given genetic region (GR_{lb,ub}) with measurements at n positions (bp_1 … bp_n), we map this region into a (0, 1) interval using the affine transformation (xxrightarrow {} frac{x-bp_1}{bp_n}). We then define a new grid of points of length (2^J) on (0, 1) as: (t_{0},ldots ,t_{N-1}), where (N=2^J, Jin {mathbb {N}}), (t_{k}= (k+frac{1}{2})2^{-J}) and (J = min{j in {mathbb {Z}}, 2^j ge n }). We interpolate the mapped signal into this grid and run wavelet transform to obtain the wavelet coefficients. In practice (see the “Applications” section), we recommend selecting genetic regions with a relatively high density of imputed SNPs.

Coefficient-dependent thresholding and quantile transform

For each individual wavelet decomposition, we use the VisuShrink approach [18] to shrink the interpolated wavelet coefficients and reduce the dependence between the wavelet coefficients within scales. This allows an estimation of the variance of each wavelet coefficient before determining a specific threshold for each wavelet coefficient. We can account for the individual heteroscedasticity of the noise by determining specific coefficient-dependent thresholds using the variance of the wavelet coefficient. Next, we quantile-transform the distribution of each wavelet coefficient within the population to make sure that each distribution follows a N(0, 1) distribution. As we use the quantile-transformed wavelet coefficient as the endogenous variable (see “Modeling” section), the above transformation ensures that, under the null hypothesis, the residuals are normally distributed. This also controls for spurious associations resulting from any deviation from the Normal distribution assumption of a linear model.


In essence, our approach to modeling aims at detecting regions containing sub-regions associated with a trait/phenotype of interest. We localize these sub-regions to ease the interpretation of the output. We first need to assess whether certain scales are associated with the phenotype at different locations to estimate the effect of a genetic region on the phenotype of interest. Within a genetic region, let ({tilde{G}}_{sl}) denote the quantile-transformed wavelet coefficient at scale s and location l. To test for association between the phenotype and the wavelet coefficient, we regress the wavelet coefficients on the phenotype (Phi) using the traditional least squares estimation for Gaussian linear models [20]. To adjust for covariates C that may be confounders in the GWAS, we incorporate the covariates into the regression models. The regression models for each scale and location are defined as follows:

$$begin{aligned}&M_1: {tilde{G}}_{sl} = beta _{sl,0} + beta _{sl,1}Phi +beta _{sl,C}C+epsilon end{aligned}$$


where C is a matrix of dimension (c times 1) and (beta _{sl,C}) is a matrix of dimension (1 times c), and (epsilon sim Nleft( 0,1right)). We compute the association parameters (beta _{sl,1}) of the wavelet regression for each pair (sl) using least squares estimation [20].

Evidence towards the alternative

For a given locus, a genetic signal might be assumed to occur in only a subset of the regression coefficients. Thus, the ({hat{beta }}_{sl,1}) may be viewed as originating from a mixture of two Gaussian distributions, each representing a specific hypothesis:

  • Under (H_0) the ({hat{beta }}_{sl,1}) are distributed as (N(0,sigma ^2_{null})).

  • Under (H_1) some ({hat{beta }}_{sl,1}) are distributed as (N(mu _{alt},sigma ^2_{alt})).

To help identify a subset of (beta) that convey the signal, we fit a mixture of the two Gaussian distributions to the collection of estimated ({hat{beta }}_{sl,1}), assuming mixture weights (1 – pi _{alt}) and (pi _{alt}), respectively. Under the null hypothesis, the full mixture is not identifiable. To estimate (sigma ^2_{null}), (mu _{alt}), and (sigma ^2_{alt}) in all cases, and to ensure that the estimated ({hat{pi }}_{alt}) becomes small under (H_0), we constrain the Gaussian mixture fitting using a modified version of the EM algorithm with the restriction that ({hat{sigma }}^2_{null} > frac{left( X^tX right) ^{-1}_{2,2}}{10}) and ({hat{sigma }}^2_{alt}>kleft( X^tX right) ^{-1}_{2,2}), where X is the design matrix with the phenotype (Phi) in the second column and k is of the order of 100.

After obtaining the estimates, we compute—for each beta coefficient—the posterior probability ({hat{pi }}_{l,s}) of (H_1) knowing (beta _{s,l}) by

$$begin{aligned} {hat{pi }}_{l,s} ={mathbb {P}} left( H_1|{hat{beta }}_{sl,1} right)&= frac{{hat{pi }}_{alt,s,l}phi left( {hat{beta }}_{sl,1};{hat{mu }}_{alt},{hat{sigma }}^2_{alt} right) }{{hat{pi }}_{alt,s,l}phi left( {hat{beta }}_{sl,1};{hat{mu }}_{alt},{hat{sigma }}^2_{alt} right) +left( 1-{hat{pi }}_{alt,s,l}right) phi left( {hat{beta }}_{sl,1};0,{hat{sigma }}^2_{null} right) }, end{aligned}$$


where (phi left( cdot ;mu ,sigma ^2right)) is the density of a Gaussian distribution, with mean (mu) and variance (sigma ^2). To reduce the influence of betas that most likely belong to (H_0), we propose a thresholding of the posterior probabilities ({hat{pi }}_{l,s}) that decrease with sample size as well as wavelet depth. Based on the work by Donoho and Johnstone [21], we define the thresholded probabilities by

$$begin{aligned} {tilde{pi }}_{l,s}= max left( {hat{pi }}_{l,s} – frac{1}{sqrt{2log(n)}sqrt{2^s}},0right) . end{aligned}$$


We later use the ({tilde{pi }}_{l,s}) to localize the sub-regions of interest (for details, see the “Model and results” section).

Finally, we compute the average evidence towards the alternative by

$$begin{aligned} L_h = sum _{s=0}^{S} frac{1}{2^s} sum _{l=1}^{2^s} {tilde{pi }}_{alt,s,l}phi left( {hat{beta }}_{sl,1};{hat{mu }}_{alt},{hat{sigma }}^2 _{alt} right) -left( 1-{tilde{pi }}_{alt,s,l}right) phi left( {hat{beta }}_{sl,1};0,{hat{sigma }}^2_{null} right) . end{aligned}$$


Note that, in contrast to the work of Shim and Stephens [10], where lower scales have more weight, our test statistic applies equal weight to each scale. As shown in the upper panel of Fig. 2, the separation between the null and alternative distributions achieved by this test statistic ((L_h)) alone is not optimal because the resulting test power is low. For additional details, see the “Complex genetic signal” section and the “Appendix”.

Fig. 2

Average evidence towards the alternative hypothesis. The simulated null distribution is shown in blue (’null’), the empirical null distribution in green (’emp’), and the alternative distribution in pink (’alt’)

Inflation of the alternative hypothesis

To improve the separation of the null and alternative distributions, we extract two additional pieces of information from the ({tilde{pi }}_{l,s}). First, we compute the average proportion of associations per scale. The proposed test statistic is a weighted sum applying the same weight to each scale. This can be interpreted as an alternative horizontal summary of the association:

$$begin{aligned} p_h = sum _{s=0}^{S} frac{1}{2^s} sum _{l=1}^{2^s} {tilde{pi }}_{alt,s,l}. end{aligned}$$


Second, we extract a vertical summary by considering sub-regions of the overall screened region. We divide the region of association into (S-1) sub-regions, where S is the maximum depth of analysis. We summarize the proportion of associations vertically, and for each average, we consider the positions that overlap with the sub-regions. For example, the first coefficient at scale 1 contributes half of the sub-region average of association.

$$begin{aligned} p_v&= sum _{k=1}^{S-1} frac{1}{n_k} sum _{s=1}^{S} sum _{l= big lfloor frac{2^S (k-1)}{S-1} big rfloor }^{big lfloor frac{2^S k}{S-1}big rfloor } {tilde{pi }}_{alt,s,l} end{aligned}$$


$$begin{aligned} n_k&=Cardleft( (s,l),forall s in left[ 1,Sright] , l in llbracket Bigg lfloor frac{2^S (k-1)}{S-1} Bigg rfloor ,Bigg lfloor frac{2^S k}{S-1}Bigg rfloor rrbracket right) end{aligned}$$


We use the new summaries of association to increase the separation between the null and the alternative by assuming that, under the alternative hypothesis, (p_v) and (p_h) tend to be larger. We then build our full test statistic (T_{S_lambda }), which requires calibration of the hyperparameter (lambda):

$$begin{aligned} T_{S_lambda }&= L_h + lambda cdot min(p_h,p_v) end{aligned}$$


A larger (lambda) would yield higher power if we assume that (p_v) and (p_h) tend to be larger under the alternative hypothesis. However, increasing (lambda) can also change the shape of the null distribution. Assuming that the null distribution is normal, we use this as a fitting criterion to select the hyperparameter.

Calibration of the hyperparameter and statistical significance

Our goal is to find the right balance between having as large of a (lambda) value as possible while keeping the null distribution normal. As (min(p_h,p_v)) is not normally distributed (bounded distribution), the larger (lambda) is, the further (T_{S_{lambda }}) deviates from normality. To strike the right balance, we simulate (L_h) and (min(p_h,p_v)) under the null. Once simulated, we compute (L_h) and (min(p_h,p_v)) for each simulation ((10^{5}) simulations in our case). Next, we fit a normal distribution on (L_h) and use this fit to generate the histogram of the p-values of the simulations for 1000 bins. We compute the number of counts in each bin and rank them by count (Figs. 2 and 3). We are particularly interested in the rank of the first bin, as an inflation of this bin would influence the false discovery rate. This procedure is repeated for increasing values of (lambda), and the search is stopped when a rank increase in the first bin is observed. We select the largest (lambda) that results in the rank of the first bin to be equal to the rank of the first bin for (L_h), denoted as (lambda ^*). Finally, we use the normal fitting of (T_{S_{lambda ^*}}) to perform the testing. In the “Appendix”, we provide a pseudo code description of the procedure (see Algorithm 1).

Fig. 3

Rank shifting of the bin of interest

As wavelet transform induces a correlation between the ({hat{beta }}_{sl,1}), it is not possible to simulate them from a univariate normal distribution using their theoretical null distribution. One option is to compute the empirical covariance matrix of the ({hat{beta }}_{sl,1}) and simulate (beta _{sl,1}) using this empirical null distribution. A second option is to simulate wavelet coefficients using random signals from a normal Gaussian distribution and then re-scale them to obtain a mean of zero and variance of one. A third possibility is to compute the covariance matrix of these wavelet coefficients and re-scale them using the theoretical null distribution of the ({hat{beta }}_{sl,1}). Similar results are achieved using these different options (see Fig. 1 in the “Appendix” for a comparison). In the “Appendix”, we also provide a pseudo code description of the procedure (see Algorithm 2 and Algorithm (2^{ bis})).

Read more here: Source link