Batch-effect detection, correction and characterisation in Illumina HumanMethylation450 and MethylationEPIC BeadChip array data | Clinical Epigenetics

Experimental design and processing steps

For the EpiSCOPE study [20], DHA supplementation and gender were balanced as much as possible across the 12 450K BeadChips on each glass slide, with these factors also randomly distributed over the 6 rows and 2 columns of 31 slides (Additional file 1: Fig. S1). Blood DNA was extracted from Guthrie card blood spots and the 369 arrays were processed according to our blocking plan in six processing runs (superbatches) at the Australian Genome Research Facility (AGRF), Melbourne, Australia.

Similarly, the BFiN study [22] employed a blocked design over the 8 EPIC arrays on a glass slide. Apart from gender, the experimental factors of interest were continuous, so the samples were blocked over each slide to ensure (as much as possible) equal representation of gender and similar body fatness distribution across the 22 slides (Additional file 2: Fig. S2). As with the EpiSCOPE study, the position was randomised to avoid correlation with slide position. DNA was extracted from neonate saliva and the 169 arrays were processed in accord with our blocking plan across two processing runs (superbatches) at the Australian Cancer Research Foundation (ACRF) Cancer Genomics Facility, Adelaide, Australia. For both studies, further detail on the design and processing is given in the methods sections here and in the relevant study publication.

Technical variation evident in control probes

The Infinium technology incorporates a number of control probe families to measure the efficiency and quality of the many steps in the protocol (the control classes are described by Illumina in their BeadArray Controls Reporter Software Guide document). Technical effects were evident across particular slides of both the EpiSCOPE and BFiN datasets. There was little commonality in the nature of these technical effects across the two datasets.

Across the EpiSCOPE 450K data, we observed slides 9 and 17 had outlier probes for the staining controls with reduced fluorescence in the expected high intensity biotin beads (Fig. 2a). Slide 1 exhibited rather high intensity staining for the target removal probes (Fig. 2b). In the extension controls, some arrays on slides 17, 21 and 25 exhibited increased variances (Fig. 2c). Arrays from slides 17 and 25 were also outliers more broadly and exhibited less fluorescence for the hybridisation, extension, stripping, specificity and non-polymorphic controls, as well as the two bisulphite controls sets (Additional file 3: Fig. S3). Collectively, the EpiSCOPE study control probe analyses show some arrays on slides 17 and 25 have increased variability across a wide range of control metrics. This suggests slides 17 and 25 will also have increased technical variance for the experimental probes, which will be observed as a batch-effect.

Fig. 2
figure 2

Selected Infinium control probes. Across the EpiSCOPE study data, outlier slides were observed for the staining (a), target removal (b), and c extension control probes. For the BFiN study data, the Type I (d) and Type II (e) bisulphite conversion control probes showed evidence of reduced bisulphite conversion efficiency for slides 13–22

Within the BFiN EPIC data, a pronounced difference between each superbatch was observed. Slides 13–22 (the entirety of superbatch 2) had large increases in intensity for the Type I unmethylated control probes (Fig. 2d) and Type II green channel probes (Fig. 2e). These control probe results suggest a reduced efficiency in the bisulphite conversion rate in superbatch 2. Some negative control probes on the red and green channels also had high fluorescence intensity for superbatch 2 (Additional file 4: Fig. S4). Otherwise, the control probes exhibited uniformity across slides. Collectively, this suggests the EPIC BeadChips were of high quality and that increased technical variance leading to batch-effect will more likely be across slides 13–22. With decreased efficiency of bisulphite conversion, probes with many CpG sites may mismatch across their length and there will be a bias in calling very low rates of methylation, as this requires high conversion efficiency.

Colour balance

Across the green (Cy3) and red (Cy5) signal intensities, we observed notable dye bias on both the 450K and EPIC array designs. For Cy3, there was consistently lower mean fluorescence intensity and less variability than Cy5 (Fig. 3).

Fig. 3
figure 3

Correlation of processing run with signal intensity and positively detected probes. Green and red channel intensities and positively detected probes grouped by slide and processing order are presented for the EpiSCOPE and BFiN studies. The EpiSCOPE study was composed of 31 slides processed in six processing runs (superbatches A–F). The BFiN study was composed of 22 slides processed in two processing runs (superbatches A–B). Slides highlighted in gold illustrate the variation observed across superbatches. The first slide of the day for the EpiSCOPE study (slides 1, 5, 9, 13, 17, 25) and superbatch B for the BFiN study (slides 13–22)

When the same Cy3 and Cy5 mean fluorescence intensity data are considered by slide, a correlation with superbatch was also observed. We note the first EpiSCOPE slide in each of the six processing runs (superbatches A–F) typically had more or less Cy3 and Cy5 signal and a reduced fraction of positively detected probes (Fig. 3). Across the two BFiN processing runs (superbatches A–B), the second superbatch had less variability in Cy3 and Cy5 signal than the first. Slides 3 and 5 had a slightly reduced fraction of positively detected probes compared to the other slides.

A bias associated with BeadChip array position on the glass slide was also observed (Fig. 4 and Additional file 5: Fig. S5). There was a distinct trend of lower fluorescence intensity in the first 2 or 3 rows compared to the latter rows. The EPIC array had reduced Cy3 and Cy5 fluorescence (Additional file 5: Fig. S5), while for the 450K array (Fig. 4), the effect was largely confined to the Cy3 channel. We also noted that for the 450K design, in which the slides are laid out as a matrix of six rows by two columns, the Cy3 channel column 1 intensities were slightly higher than the corresponding column 2 intensities. As the Sentrix scanner is scanning arrays from row 1 onwards, the lowered fluorescence is not consistent with photodegradation. Instead this could possibly be due to the flow-through chambers in the Infinium XStain protocol, where XStain reagents flow from the last row to the first row (Melinda Ziino, personal communication).

Fig. 4
figure 4

EpiSCOPE fluorescence intensity slide positional effect is reduced with preprocessing methods. Infinium green (Cy3 dye) and red (Cy5 dye) fluorescent intensities are formulated into methylated (meth) and unmethylated (unmeth) signals. These meth and unmeth signals are used to calculate β and M values. If the 369 BeadChips in the EpiSCOPE set are grouped by row (R) and column (C) position on the glass slide, there is evidence that the distribution of fluorescent intensities is associated with this position. The position effect diminishes with preprocessing methods. For some between-array methods no variation in mean is observed, as all the BeadChips have had mean fluorescent intensities moderated to be the same

The position bias should particularly affect Infinium II probes, as unbalanced falls in cyanine dye intensity will directly translate to biases in the methylated or unmethylated signal. The evident dye bias by array position implies that correction for dye bias is important in removing a source of batch-effects and that across slides, experimental factors of interest should not be structured by rows (and columns for the 450K array design).

Preprocessing methods and the removal of position and dye bias

A total of 12 preprocessing methods were used to normalise the Infinium BeadChip data. Some of these methods incorporate dye bias correction within the process and some can accept data previously corrected for dye bias. The normalisation methods fall into three broad groups: raw preprocessing, within-array and between-array methods. Raw preprocessing directly translates Cy3 and Cy5 intensities into methylation and unmethylation signal, so there is no correction for probe design or dye bias. Within-array methods consider factors such as probe type and background correction and adjust each array individually. The six methods used were Subset-quantile Within Array Normalisation (SWAN) [23], Exponential–Normal mixture signal intensity background correction (ENmix) [24], Beta MIxture Quantile dilation (BMIQ) [25], Normal–exponential out-of-band convolution (noob) background and dye correction [26], SWAN performed on noob-corrected data (noob + SWAN) and similarly, BMIQ on noob-corrected data (noob + BMIQ). The between-array methods consider similar factors to within-array methods but seek to make the set of arrays identical in statistical properties by conditioning the data using techniques such as quantile normalisation. Applying quantile normalisation results in methylation arrays all scaled to have the same mean methylation and unmethylation signal. This large moderation in signal is best suited to circumstances where the biological variation in DNA methylation across arrays is modest, with most probes invariant. When the global invariance assumption is not met, larger global changes in distribution will be lost and false positives can be introduced from erroneous adjustment of features. The five between-array methods used were standard Illumina preprocessing [27], Dasen [28], Dasen performed on noob-corrected data (noob + Dasen), ENmix with subsequent quantile normalisation (ENmix—Quantile) and Functional normalisation [29].

For the ten methods that process and return methylation and unmethylation channel data, we examined the removal of position or dye bias (Fig. 4, Additional file 5: Fig. S5). We found obvious bias with raw preprocessing, while all the other normalisation approaches greatly improved the data quality. The within-array normalisation SWAN method randomly selects discrete subsets of Type I and Type II probes with similar CpG content, performs subset quantile normalisation on these subsets and interpolates the remaining probes to define new intensities. The SWAN method increased the methylation mean signal and dispersion to be nearer that of the unmethylated channel, which removes a large amount of position bias. The noob method uses normal–exponential convolution for background correction on DNA methylation arrays using the out-of-band fluorescence intensities of Type I probes on the array, as well as the control probes. This greatly increases the amount of data to model background fluorescence. The noob method was observed to compress the intensity data with the unmethylated signal having a mean and dispersion similar to the original Cy3 signal. The noob methylated and unmethylated signals appeared more uniform in comparison with SWAN-normalised data. This equalisation removes bias as the formation of M values or β is less impacted by dye or position. The combination of SWAN and noob was subtle with marginally more methylated signal dispersion. The ENmix preprocessing method models the methylation signal intensity with a flexible exponential–normal mixture distribution, together with a truncated normal distribution to model background noise and is an extension of robust multi-array-average (RMA) and noob. ENmix increased the mean and dispersion of the methylated signal to be more like the Cy5 signal, resulting in removal of much of the position bias.

For the between-array normalisation methods, we found Illumina preprocessing method scales all the fluorescence intensities from arrays towards an array from row 1. This scaling removes obvious position and dye bias from the data with a small amount of dispersion in array means remaining. Dasen combines background adjustment with a between-array normalisation method that is applied separately to methylated and unmethylated channels further subdivided into Type I and Type II probes. Dasen does not perform dye bias correction, so it is reasonable to also normalise data previously corrected by noob. Dasen moderated the means of all the arrays to be the same so no dispersion of BeadChip means remained. A large difference in fluorescence intensity still existed between the methylated and unmethylated channel but this disappeared with the use of Dasen in combination with noob. The ENmix authors developed a quantile normalisation method which can be run on ENmix corrected data. This hybrid ENmix-quantile normalisation approach, like the Dasen and noob combination, also equalised the methylated and unmethylated fluorescence intensities. Functional normalisation is typically coupled with prior noob background correction. The method extends quantile normalisation and only removes variation explained by a set of covariates associated with technical variation, such as the first two principal components of Infinium control probes. This PCA-based approach has similarities to Harman batch-effect removal but is more limited in scope. Functional normalisation returned results similar to Illumina preprocessing; however, there was less position bias.

In summary, the use of preprocessing to remove probe type, background and dye bias was highly effective in removing much of the obvious glass slide position bias, with methylated and unmethylated signal largely equalised. The means of the methylated and unmethylated signals could be further moderated together using between-array normalisation methods.

Global methylation overview and batch-effect model specification

Unsupervised learning via principal component analysis (PCA) was used to examine global methylation patterns and determine the impact of different normalisation methods on the data, to gauge the influence of biological factors and to identify batch-effects (Fig. 5). Coordinated increased DNA methylation due to X chromosome inactivation in females (reviewed in [12]) is often a large source of correlation across the samples, so to allow easier inspection of technical effects in the data, we removed all X and Y chromosome probes from the PCA analysis. Both the PCA analysis and batch correction used M values. Post-correction, the M values were converted to β via a logit transform.

Fig. 5
figure 5

Principal component analysis of the EpiSCOPE and BFiN data. PCA was conducted on the original raw preprocessed M values and again after correction via Harman or ComBat. The data are presented with the number and colour signifying BeadChip slide identifier and the bold and pastel shading signifying male and female gender, respectively. PCA was also conducted on noob preprocessed data and coloured by slides of note across processing runs (those slides highlighted in gold in Fig. 3), or estimated cellular fraction. For the EpiSCOPE data (a), dimensions 1 and 2 of the PCA plots show the data to separate by slide. This was particularly evident in slides 1 and 25 and less so for slides 9 and 17. Arrays from slide 5 separated out discretely on dimension 4. The PCA plots of Harman or ComBat corrected data show the absence of data separation by slide; instead the corrected data show a strong separation by gender in principal dimensions 3 and 4, despite the data being limited to autosomal probes only. Separation of the data by DHA supplementation (experimental treatment) was not apparent in the principal components examined. b Consistent with the control probe findings, the 450K slides with high technical variation (slides 1, 5, 9, 17, 25) are the first arrays processed in each processing run. c Some separation of the data on the fourth dimension by the estimated proportion of neutrophils in the blood sample was observed. In the BFiN data PCA analysis (d), there was not obvious separation of the raw preprocessed data by slide identifier on dimensions 1 and 2. However, slide 3 clearly separated out on dimension 4. Batch correction via Harman or ComBat was sufficient to remove the separation of slide 3. The PCA plots of noob preprocessed data illustrate the two largest factors influencing the autosomal probes; e the eigenvalues for dimension 2 showed two clouds of samples—one for slides 1–12 and the other, slides 13–22 and f cellular composition—with saliva samples containing a higher immune cell component separating out on dimension 1. Within each of these two clouds there was further structure, with samples from some slides clustering together. For the BFiN data, the technical (batch) variation is largely due to processing run (superbatch) and less so, the individual slides

To identify and remove batch-effects using Harman or ComBat, a batch variable needs to be declared to the algorithm, as well as sources of biological variance that the algorithm should seek to retain. The EpiSCOPE data PCA analysis illustrated that arrays are often separating in groups composed of the sets of 12 arrays on the same glass slide, so we may consider this to be a ‘batch’. The superbatch structure would be each of the six processing runs. With the balanced design of the experiment, there was a near equal representation of gender and DHA supplementation across each slide. For Harman, the experimental variable to keep was declared as a compound variable of gender with DHA supplementation, while with ComBat, gender and DHA supplementation were the two covariates specified in the linear model. For both batch-correction methods the largest source of technical variation to remove was slide number (Fig. 5a, b). A minor impact of blood cellularity could be found in PCA dimension 4 of the noob-normalised EpiSCOPE data (Fig. 5c).

The BFiN data PCA analysis showed the greatest structure in the data due to technical effects was the processing run (Fig. 5d, e) and that cellular composition of the samples was a large source of biological variance and separated the data across PCA dimension 1 (Fig. 5f). Harman supports only categorical factors, so this cell component estimate was used to cut the samples into two groups; low and high immune component with a cut point at 6% estimated immune cell fraction. The BFiN data also had a balanced design across the slides by gender. For Harman, gender and the two immune component groups were combined as a compound variable specifying the biological variance to keep. With ComBat, gender and the immune component specified as a continuous variable were given to the linear model. Like with the EpiSCOPE data, for both batch-correction methods the source of technical variation specified to remove was slide number. Removing other sources of technical variance were also tested including using superbatch as the batch variable, or a combination of first removing variance associated with superbatch and then slide number. Specifying just the slide number was found to be sufficient to also remove the superbatch structure (analysis not shown).

Removal of batch-effects

As observed from PCA plots (Fig. 5a, d), Harman and ComBat were highly successful in removing the batch-effect across samples. Both methods removed the separation of the data by slide (technical variance) and importantly did not remove the separation of declared biological factors of interest (such as gender and cell composition). Next, for each normalisation method, the amount of batch correction made to individual β values was examined as a measure of method performance. A β difference matrix was computed by taking the difference between the batch-corrected and original β matrixes, and then, for each individual CpG probe the absolute value of the interval between the minimum and maximum β difference was calculated. Large values in this maximal probe-wise β difference statistic highlight probes where one or more batches required a large amount of correction relative to others. The distribution of the maximal probe-wise β difference statistic across CpG sites can be examined to understand the effect of the various normalisation methods. Further adjustment made by batch-effect correction software is on the residual technical variance left after preprocessing. We propose that superior normalisation approaches should have more probes with smaller maximal probe-wise β difference statistic, as this suggests reduced technical variance across all batches.

On both the 450K and EPIC array formats, Type I probes were found to require far less batch correction than Type II probes (Fig. 6). This fits with expectation, as Type II probes are far more prone to colour bias and by extension, position bias. The maximal probe-wise β difference statistic was used to group the probes into those requiring low (< 0.01 β), moderate (between 0.01 and 0.10 β) and high (> 0.10 β) levels of adjustment for batch-effect on one or more slides. ComBat more aggressively adjusted the data with more probes in the high adjustment group across the two array designs and for all the preprocessing methods, except for the Illumina preprocessing method on the EPIC array, which was approximately equivalent (Table 1, Fig. 6).

Fig. 6
figure 6

A global analysis of batch-effect corrections made after various preprocessing. Across each probe in the EpiSCOPE (a) and BFiN study (b), the maximal probe-wise beta difference after batch-effect correction was determined and a density plot constructed. The area under the curve illustrates the maximal level of adjustment across the probe distribution. The two vertical lines highlight probes with 10% and 1% maximal probe-wise beta difference (− 1 and − 2, respectively, in log scale). This is consistent with the segmentation used in Table 1. The rightmost column illustrates that if raw preprocessed data are subset into Type I (grey dashes) and Type II (grey dots) probes, many of the Type I probes had less batch-effect correction adjustment

Table 1 Beta differences after batch-effect removal across and datasets and preprocessing methods

It is expected that normalisation methods accounting for bias due to factors such as colour, position and probe type should need less correction for variation across slides. The within-array preprocessing methods of ENmix, noob and the combination of noob and SWAN or BMIQ were effective in shifting probes requiring a moderate adjustment with raw preprocessing to the low adjustment group (a shift from maximal probe-wise beta difference between 0.01 and 0.10 β to under 0.01 β). For the between-array methods, the noob and Dasen combination or the ENmix-Quantile methods were effective at increasing the proportion of probes requiring low adjustment. The use of Dasen (without the addition of noob) was particularly effective in reducing the number of probes in the high adjustment group (Table 1, Fig. 6).

For the remainder of our analyses, noob-normalised data were used. The noob method has good performance in correcting for batch-effect and underlies, or can be combined with, several other normalisation methods. It also makes few assumptions about the data and is a within-array method, so is suitable for experiments with high biological variation, where quantile normalisation might not be appropriate.

Identifying clustered methylation

The distribution of β at particular CpG sites may be modal in nature, which results in a clustered methylation pattern across samples. In practice, clustered methylation profiles can be due to technical or biological factors, and batch correction should ideally remove the former but preserve the latter. As biological variance to keep is declared per sample, rather than per feature, batch-effect removal software is prone to erroneously attributing methylation clustering to technical factors. When the distribution of biologically relevant clustered methylation is unbalanced across batches, then batch-effect removal software will inappropriately seek to converge the means of each batch. It is important to identify the CpG sites subject to such erroneous correction as this adjustment can profoundly destroy biologically meaningful clustering of the data. Which CpG sites naturally exhibit clustered methylation in populations is not well characterised, so a data-driven empirical approach was employed to find them.

To investigate, an optimal univariate k-means clustering method was used to identify multi-modal features in noob-normalised data, with each cluster having at least 5 samples and a β difference between the cluster centroids of 0.1. When first defining the number of clusters for each probe, samples arising from slides with the greatest batch-effects (EpiSCOPE slides 1, 5, 9, 17, 25 and 30 and BFiN slide 3) were set aside to reduce the impact of technical factors inflating the number of clusters. For each probe, the optimal number of clusters (k) was identified using Bayesian information criterion (BIC). Next, the previously removed samples were added back and clustering was undertaken again using the previously determined value for k. For each multi-modal probe (k > 1), the association between gender, batch, superbatch and cell composition was examined using statistical tests (described in Methods). Clustered methylation can often be due to allele-specific methylation (ASM), so for autosomal CpG probes with 2 or 3 clusters, we examined if the assortment across clusters fit with expectations from population genetics by employing a test for deviation from Hardy–Weinberg equilibrium (HWE). For all the described tests, a stringent FDR-moderated cut-off of p = 0.001 was used for establishing associations.

For the EpiSCOPE 450K data (Additional file 6: Fig. S6), a total of 22,281 CpG probes were found to be clustered, with 19,816 CpG probes having 2 clusters and 2379 and 86 for 3 and 4 clusters, respectively. There were 11,733 (52.7%) of clustered probes in HWE with 6086 (27.3%) measuring CpG methylation at the site of a common SNP and another 1919 (8.6%) having the probed CpG site within 10 bp of a common SNP (SNP-proximal). As expected, often the probes in HWE also overlapped with those found to be SNP-mapping (4786 probes) or SNP-proximal (848 probes). Of the probes in HWE, 4045 could not be linked to a nearby SNP. Another 7337 (32.9%) probes were clustering in association with gender with the great majority of these positioned on the X or Y chromosomes (6595 and 378 probes, respectively). Only 281 (1.3%) of probes clustered due to variability in the immune cell component, while a total of 1477 (6.6%) probes associated with batch and none with superbatch. The association with batch was expected to be low as the cluster discovery stage excluded the most batch-effect prone slides (further details in Methods). A total of 2011 (9.0%) of probes were cross-hybridising probes documented previously [30]. Finally, 1539 clustered probes could not be associated with any of the factors examined.

The distribution for the 80,752 clustered probes in the BFiN EPIC data was 76,041, 4588 and 123 for clusters of 2, 3 and 4, respectively (Additional file 7: Fig. S7). A large proportion were linked to variability in the immune cell component, with 48,762 (60.4%) probes associated. A total of 69,001 (85.4%) clusters were in HWE. We expect this high proportion is not due to a link with ASM but is due to the small number of samples with a high immune cell content. With 48,491 (99.4%) of the cell component-associated probes having two clusters and largely a small number of samples in the second cluster, we can expect that HWE will be upheld as the data will have a distribution similar to that for a rare allele. Consistent with this expectation, almost all probes associated with cellular component are in HWE (48,528 in total). There were 10,703 and 6160 probes at the site of a common SNP or SNP-proximal, respectively, with strong overlap with the clustering SNP-associated probes in the EpiSCOPE data, with 4530 of 6086 (74.4%) common SNP probes and 984 of 1919 (51.3%) SNP-proximal probes also clustering in the BFiN data. A total of 8304 probes in HWE could not linked to a nearby SNP, which is around twice the number in the EpiSCOPE data. For the 10,386 (12.7%) of probes clustering in association with gender, 9723 and 439 probes were on the X or Y chromosomes, respectively, and 5,630 (76.7%) were also associated with gender in the EpiSCOPE data. Only 2 probes associated with batch (slide number), whereas 1481 associated with superbatch. A total of 1813 were cross-hybridising probes, and 476 remaining probes could not be associated with any of the factors considered.

Evaluating batch correction for each probe

While several biological factors were found to be associated with the phenomenon of multi-modal methylation, these factors cannot be used with certainty to identify all erroneously adjusted probes. For example, well-studied metastable epialleles such as nc886 have a robust imprinting-like clustering pattern but are not associated with a proximal SNP, nor parent-of-origin such as with imprinting [20, 31].

To better characterise the biologically relevant clustering set, we employed an empirical technique to evaluate the performance of batch-effect removal algorithms on individual probes. The intention of batch-effect correction is to remove technical variance; it follows that if the process is working correctly, a reduction in the sample variance (or standard deviation) of features should be observed after correction. A simple reduction in total variance cannot be used as a metric, as batch correction may highly disrupt biologically meaningful clustering but will reduce the overall dispersion of the data. Instead, the metric needs to be generalised for probes with either a unimodal or multi-modal distribution. For this purpose, we created a cluster-aware dispersion metric.

For the CpG features which form a multi-modal distribution, partial sums of squares were computed for each individual cluster using the dispersion from the mean for the members of a given cluster. The partial sums of squares were then collated across clusters to form a sample variance (as described in Methods). This modified calculation of sample variance will account for both the removal of assumed technical variance by Harman or ComBat and the preservation of a biologically meaningful clustering pattern. The log-base-2 of the ratio of sample variance precorrection over post-correction (log-variance ratio, LVR) was used as an empirical measure that batch correction was performing on multi-modal probes as intended. An LVR > 0 signifies that variance was inflated for a given probe after the application of the Harman or ComBat methods, whereas an LVR < 0 demonstrates that batch-effect correction is reducing the variance as intended.

We observed that in instances of unintended removal of biologically meaningful methylation clustering (Fig. 7), such as arising from ASM (examples being probes cg25465065 and cg15544633), batch correction reduced the overall dispersion of the data (resulting in a smaller standard deviation) but greatly increased the LVR statistic above 0. However, in cases where the clustering had a biologically meaningful association with gender (such as probe cg00455876), the prior declaration of gender to the batch-effect algorithm as variance to preserve was sufficient to keep the LVR near 0. We also uncovered more complex methylation clustering associations with both gender and genetics (cg15410402). Despite the batch correction algorithm seeking to preserve gender-associated variance, the dual associations led to erroneous correction and an LVR above 0.

Fig. 7
figure 7

Biologically meaningful methylation clustering. Some probes with biologically meaningful methylation were erroneously corrected. Four example probes are illustrated. Each scatter plot compares methylation across slides (X axis) with the methylation β value (Y-axis). The datapoints are from each of the 369 Beadchip arrays, with the data sorted and coloured by slide number. The panel is ordered column-wise from left to right as original, Harman-corrected and ComBat-corrected data. It was observed that the standard deviation (SD) of the data remained the same or less; however, the log-variance ratio (LVR) was elevated considerably above 0. The mean β shift (Shift) is the mean change in β across all the 369 arrays induced by erroneous batch correction. The mapping of common SNPs falling within CpG sites can be used to identify CpG sites which should not be batch corrected. An example of this the probe cg25465065, which has the common C/T SNP rs3768276 positioned at the cytosine and as expected, the frequencies in each cluster are consistent with expectations of the Hardy–Weinberg equilibrium. However, the methylation as measured by probe cg15544633 on chromosome 2 is clustered to two groups: intermediate methylation and no methylation. This clustering is not in Hardy Weinberg equilibrium (p = 9.520 × 10−9), yet the clustering is likely influenced by genetics as the common SNP rs2516834 is immediately adjacent to the assayed CpG site. In the example with the Y chromosomal probe cg00455876, there is clearly a higher methylation state in males and this is clustering is still apparent after batch correction as gender was declared as biological variance to preserve. However, more complex gender associations may arise, in which batch-effect correction performs poorly. One of the alleles for the X chromosomal cg15410402 probe is inactivated in females, but the methylation state in males is complex, with almost half of the males having intermediate methylation and half no methylation. This may well be due to an interaction between gender and genetics, likely due to the influence of the commonly deleted sequence 5’-GGAGCTAGGCCG (rs66532084) 12 bp upstream from the measured CpG site

In instances where a probe had batch-effects removed as intended, both the overall standard deviation (SD) was lower, as well as the LVR falling below 0. Figure 8 provides various examples of the type of batch-effects discovered in the data. These include probes with obvious batch-effects in both the 450K and EPIC data (cg01381374), to those limited predominantly to one of the sets (cg22256960), to slide-based batch-effects in one but showing a positional batch-effect in both (cg27298252), and clustering associated with both gender and batch-effect (cg04294190).

Fig. 8
figure 8

Examples of probes exhibiting obvious batch-effect. In some instances, probes had clear batch-effects which were corrected by application of Harman or ComBat. The panel layout is consistent with that in Fig. 7, with four examples of batch-effect prone probes illustrated. After batch correction, the SD was typically reduced and the computed LVR was considerably less than 0. Typically, as is the case with cg01381374, the particular influence of these probes on the data was idiosyncratic to the dataset. In other instances, there was high technical variance in one dataset but not the other. In the case of cg22256960, the batch-effect is limited to the EPIC superbatch 2 data. The example of cg27298252 highlights that batch-effect can be found both across arrays and by the position in the array. In particular, the EPIC data illustrate clear positional bias. The cg04294190 probe demonstrates that both technical and biological factors can contribute to methylation clustering. In this case, the data are clustered both by gender and within the 450K data, by slide number

The above examples illustrate the need for a discovery-based approach. We found the LVR statistic was particularly useful at identifying CpG sites having biologically relevant clustered methylation that should not be subjected to batch-effect correction (erroneously corrected probes). When applying the LVR statistic to identify probes prone to batch-effect (batch-effect susceptible probes), we found that fully methylated or unmethylated CpG sites often had very small corrections made to them by Harman or ComBat, but given the initial small variance, this resulted in large changes in ratio. The LVR statistic was combined with the mean difference between corrected and uncorrected data to correlate changes in variance and shifts in beta values (Figs. 7, 8). A shift in beta cut-off was introduced to focus towards those probes with a large shrinkage in LVR and an appreciable difference to β after correction.

To classify erroneously corrected and batch-effect susceptible probes, we used a cut-off of 50% variance increase (log2(1.5), LVR = 0.584) or decrease (log2(1/1.5), LVR =  − 0.584), respectively, and mean β shifts of at least 0.01. How these cut-offs interacted with known clustered methylation associations was also characterised (Fig. 9). As anticipated, modal probes with a known common SNP at the measured CpG site often had LVRs much greater than 0 and larger mean β shifts. In the intersection between the 427,274 probes shared across the EpiSCOPE and BFiN datasets, there were 31,449 erroneously corrected and 7,478 batch-effect susceptible probes common to both datasets.

Fig. 9
figure 9

Isolating erroneously corrected and batch-effect susceptible probes via Log-variance ratio and mean β shift. For all probes in the a EpiSCOPE and b BFiN datasets, the change in variance after batch correction (expressed as log-variance ratio) relative to the degree of batch correction (expressed as mean β shift) was plotted. The same data were then highlighted for clustered probes which were modal in distribution. These modal probes were in turn subdivided into probes which were modal and associated with imprinting, a single nucleotide polymorphism at the measured CpG site, cellular component and batch-effect. The vertical lines are plotted at LVR of 0.584 and − 0.584 and the horizontal line at a mean β shift of 0.01

Validation

Analysis of our unconfounded high-quality 450K and EPIC data established two sets of probes: one that was prone to batch-effect and another erroneously modified by batch-effect algorithms. It is useful to further explore the generality of these sets. Inclusion of further data from different studies and geographies may allow discovery of methylation clustering associated with less common SNPs or certain populations, and batch-effect susceptible probes needing significant correction in multiple datasets. To explore the universality of our findings, we searched for additional large 450K and EPIC datasets with these selection criteria: (1) hundreds of samples and an EWAS design, (2) with raw IDAT files in the public domain, having (3) the original Illumina Sentrix identifiers to enable the study of slide and positional effects, with (4) the experimental variable(s) and gender made known, such that biological variance to preserve can be declared to the batch-effect software, also (5) with reasonably subtle phenotype to make technical factors obvious and finally (6) the slide identifier and biological phenotype not being overly confounded.

For the 450K array validation set, we selected an 845 participant (657 women and 188 men) methylation sub-study of peripheral blood as part of the European Prospective Investigation into Cancer and Nutrition (EPIC-Italy), a molecular epidemiology project on diet and cancer [32]. For EPIC array validation, we selected two US originating datasets: 536 EPIC arrays prepared from buccal DNA samples from infants born less than 30 weeks postmenstrual age, as part of the Neonatal Neurobehavior and Outcomes in Very Preterm Infants (NOVI) Study [33], and 392 arrays from cord blood and peripheral blood at 7 years old from 196 children, as part of the Urban Environment and Childhood Asthma (URECA) study [34]. In the NOVI dataset, two arrays were removed: one with a truncated and corrupted IDAT file and another had > 5% failed probes. What remained were data from 235 female and 299 male neonates. The URECA data also had one array removed due to > 5% failed probe count leaving data from a total of 192 female and 199 male children.

The 450K EPIC-Italy cohort was divided across 72 slides. The control probes suggested a reduction in the quality of the data after slide 28, with many slides showing unfavourable control probe intensities across the broad range of sample dependant and independent control sets (Additional file 8: Fig. S8). The NOVI array set was spread across 78 EPIC slides and had good results for the sample independent controls, but high signal in a subset of green negative control probes and a number of outlier fluorescent intensities in the specificity I and II controls. Collectively, this suggests some non-specific extension (Additional file 9: Fig. S9). The URECA data were composed of 49 slides with a well-balanced design, having gender and tissue split evenly across arrays. The control probes generally signified high quality, except for three slides (Additional file 10: Fig. S10).

PCA was used to identify slides subject to batch-effect and the influence of biological factors, such as cell composition. As before, when identifying the optimal k for each CpG probe cluster, the most batch-effect prone slides were excluded for this step. In total, 3, 19 and 25 slides were set aside for cluster identification in the URECA, EPIC-Italy and NOVI datasets, respectively. During batch-effect correction, the biological factors of interest for the EPIC-Italy dataset were specified as gender and neutrophil composition, for NOVI, gender and the proportion of immune component cells and URECA, gender and tissue source (cord or peripheral blood). In each of the three cases, the batch-effect variable specified was the slide identifier.

Universality across datasets

The combination of mean β shift and LVR was used to identify probes shifted substantially by batch correction and also exhibiting appreciable decreases in LVR (batch-effect susceptible) and increases in LVR (erroneously corrected) (Additional file 11: Fig. S11). The same mean β shift of 0.01 and 50% change in LVR cut-offs as the EpiSCOPE and BFiN data were employed.

A total of 427,160 probes were cross-examined across the five datasets; this constitutes 452,453 probes common to all the datasets, less the 25,293 multi-modal distributed probes having a strong association (p = 0.001) with cell composition in one or more of the datasets. This cell composition set was removed as erroneous correction due to unbalanced within-batch sample cell composition is particular to each study and driven by the nature of the biosample. In particular, the BFiN saliva samples and NOVI buccal swabs are a mixture of epigenetically distinct immune and non-immune cells, resulting in a large increase in probes having a multi-modal distribution. While the cell composition differences were modelled and declared to the batch-effect algorithm, this may not be sufficient to prevent erroneous correction in all instances.

Lists of batch-effect susceptible and erroneously corrected probes were generated and compared across datasets. In total, 229,681 (53.8%) of probes are batch-effect susceptible in at least one of the datasets; however, only 15,094 (3.5%) are batch-effect susceptible in at least 4 of 5 datasets and 4,649 (1.1%) in all datasets examined (Table 2). The number of batch-effect susceptible probes unique to one of the datasets was proportional to the size of the dataset and the quality of the data control probe metrics (Additional file 12: Fig. S12). The high-quality BFiN dataset only had 17,922 such probes with 3169 (17.7%) unique to that dataset.

Table 2 Batch-effect susceptible and erroneously corrected probes across datasets

A much smaller set was found for erroneously corrected probes, with 8,755 probes (2.0%) in at least one of the datasets, 1620 (0.4%) erroneously corrected in at least 4 of 5 datasets and 856 (0.2%) in all datasets examined (Table 2, Additional file 13: Fig. S13). Overwhelmingly, the erroneous adjustment of probes is related to multi-modal CpG site methylation arising from the influence of a genetic variant at the CpG site (see SNP-associated multi-modal probes in Additional files 6, 7: Figs. S6, S7). As such, it is expected that the identification of erroneously corrected probes is a function of (1) cohort size, (2) genetic distinctness of the population and (3) quality of the array data.

There were 992 probes unique to URECA, despite it having less than half of the cohort size of EPIC-Italy. The URECA cohort was composed of 75% African American, 20% Hispanic and 4% admixed participants. African ancestry populations are known to have more genetic diversity than non-African populations [35]. The 968 and 571 unique probes in the EpiSCOPE and BFiN cohorts are high given the cohort size. There were also another 1,490 probes shared exclusively between EpiSCOPE and EPIC-Italy (Additional file 13: Fig. S13).

The NOVI cohort is relatively large (n = 534) and has a multi-ethnic genetic background with 22% African American, 7% Asian and 7% Hawaiian or Pacific Islander background. However, the number of unique probes (144) was the lowest of all the five cohorts (Additional file 13: Fig. S13). We posit this is due to data quality and use of buccal cell biosamples, with the large variances in the data masking some genetic effects on the data. In support of this notion, we note that NOVI had the highest number of probes (359) not shared with the four other cohorts. This can be interpreted as the clusters being harder to identify.

Finally, we note that of 856 probes common to all the datasets, according to dbSNP v151, 768 (89.7%) of these are directly at the site of a SNP and another 20 (2.3%) have the measured CpG site within 10 bp of a SNP.

Lists of troublesome probes

With large EWAS designs, we suggest that beta values should be characterised empirically for clustering and for changes in variance before and after batch correction using the LVR statistic. However, this approach is not suitable for small studies as sufficient arrays are required to identify clusters. The need for larger datasets is magnified for ASM-associated clustering involving SNPs with small minor allele frequencies. Therefore, we provide reference matrices containing the LVR statistic and mean β differences (rounded to 4 decimal places) for all the arrays examined in this study (Additional file 15: Table S1). This list comprises data from 1214 450K and 1094 EPIC arrays from regionally diverse and multi-ethic populations across Australia, the USA and Italy and spanning multiple commonly collected biosamples (blood, buccal cells and saliva). This reference matrix will allow investigators to identify erroneously corrected and batch-effect susceptible CpG probes in their study, particularly for rarer SNPs in the population under inspection. For ease of use, this matrix is also distributed as part of the HarmanData package available on Bioconductor and can be called directly into an R session.

The nature of probes more subject to batch-effects

Four factors were considered for batch-effect susceptibility: cross-hybridisation due to highly homologous sequences, Infinium design (Type I or II), the number of CpG sites internal to the probe and probe melting temperature (Tm).

Previously, Chen et al. identified a list of 29,233 probes in the 450K design which are highly homologous with 47–50 bases matching to a cross-reactive target [30]. Often this is a result of probes targeting repetitive genomic sequences or genes that have pseudogenes or homologous genes [36]. In a related exercise, Benton et al. mapped probe sequences to the human genome using BOWTIE2 [37] and identified 33,457 probes aligning greater than once [38]. There is a large overlap between the Chen and Benton set, with 21,361 probes shared between the sets. Both of these were overlapped with the sets of probes considered as batch-effect susceptible (Table 2). While these sets were minor contributors to the total number of probes batch-effect susceptible in 1–3 datasets, cross-hybridisation and multi-mapping were a large factor in probes batch-effect susceptible in 4 or 5 datasets and constituted 39.7% and 35.4%, respectively, of the 4649 probes found in all datasets.

Infinium design also held some influence. Type I and II probes constituted 1748 (37.6%) and 2901 (62.4%), respectively, of the probes universally considered batch-effect susceptible, whereas they were 27.4% and 72.6% of the total set of 427,160 probes. There is a significant bias towards Type II probes being batch-effect susceptible in all sets (Fisher exact p < 2.2e−16). Interestingly, the sets were enriched for those Type I probes having 2 or less internal CpG sites and Type II probes having no internal CpG sites (Table 2).

A relationship between sensitivity to a batch-effect and probe melting temperature (Tm) could also be found for the EpiSCOPE and NOVI datasets in particular (Fig. 10). While most probes have an in silico determined Tm of ~ 75 °C, the 7,961 probes (from the set of 427,160) with a Tm of less than 70 °C show high between-batch correction. The limitation to the EpiSCOPE and NOVI datasets might be due to factors such as a difference in salt concentration in the hybridisation buffer, the oven set temperature or fluctuations from that, or the length of hybridisation time. Interestingly, for the EpiSCOPE set, the sets requiring the most correction were the first set processed each day (communication with service provider).

Fig. 10
figure 10

The influence of probe melting temperatures on batch-effect. For each of the five studies and two probe types (I and II), the relationship between probe oligonucleotide melting temperature (Tm) and batch correction (mean β shift) was examined. A subset of Type II probes in the EpiSCOPE and NOVI study data were observed to require more batch correction when the probe Tm is low

In summary, probe batch-effect susceptibility is multi-factorial in nature and was strongly associated with all the four factors examined. We note the low Tm set is largely different to the cross-hybridisation and multi-mapping sets, with only 198, 98 and 414 probes shared between the Benton, Chen and all sets, respectively. The set of batch-effect susceptible probes described here has overlap with the previously published sets described by Chen et al. and Benton et al. but is discrete.

Findings of false positive probes in published EWAS studies

To gauge the impact in the literature of batch-effects on the outcomes of EWAS studies, a search was conducted across PubMed for appearances of the 4649 probes identified as batch-effect prone in all the 5 studies examined here. The R library easyPubMed (2.13) was used to query titles and abstracts in PubMed [39] for the term ‘EWAS’ in conjunction with one of these probe identifiers. In total, 3 studies were found.

In an EWAS considering Parkinson’s disease, Moore et al. [40] listed cg11963436 as one of the top-12 loci and interestingly found the locus to also replicate using Sequenom EpiTYPER for 219 individuals. We observe this probe to show profound batch-effects in the EpiSCOPE and URECA data (Additional file 14: Fig. S14). Another EWAS from Chen et al. [41] considering reduced kidney glomerular filtration rate in 567 HIV-positive and 117 HIV-negative men describe cg18368637 as one of the top-3 loci. For this probe, there are obvious batch-effects across the 5 datasets we examined (Additional file 14: Fig. S14). Finally, in a recently published re-analysis of existing gestational diabetes mellitus (GDM) EWAS data, Liu et al. [42] found cg22385669 to be one of 62 significant CpG methylation sites and 1 of the 6 probes in their SVM model predicting GDM occurrence. This probe has a more subtle batch-effect than the previous two examples but is still readily apparent (Additional file 14: Fig. S14).

All three examples had batch-effects readily observable via visual examination in the 5 datasets considered here. This illustrates the importance of plotting β ordered and coloured by slide number as a diagnostic to find potentially spurious associations.

Read more here: Source link