Comparison of read level and improving the mapping efficiency according to trimming
Since the generation of high-quality WGBS data ultimately impacts the quantification and interpretation of Cs methylation levels, it is indispensable to monitor the raw data quality and interrogate the appropriate pre-processing step to cleanse data [1]. To avoid biased results by different sample preparation and library construction processes, we split the same library (Additional file 2: Fig. S1a and b) into two copies for two sequencing platforms. Finally, we obtained eight datasets as in Fig. 1. Based on an average of 321 bp length for two h293 libraries and 327 bp length for two mouse libraries, we generated 204.94 ± 43 (mean ± standard deviation (SD))million reads for the four h293 samples and 64.91 ± 12.01 million reads for the four mouse samples (Additional file 3: Table S1), of which the minimum value of Q30 was 90%. Even though high quality of raw data was obtained, due to the result of quality scores (Additional file 4: Fig. S2) and base distribution (Fig. 2b) using FastQC in the pre-processing step, we found that the first ten bases of read2 and the end ten bases of read1 showed relatively low quality. This phenomenon presumably resulted from the PBAT protocol, which suffered from a higher percentage of low-quality bases along the start of reads [39], or maybe because of the inadequacy of those sequencing instruments, requiring more detailed research in the follow-up research work [44]. Thus, we set various cutoffs, including 10 bp, 20 bp, 30 bp, 40 bp, and 50 bp (Fig. 2a) for the trimming stage and detected the mapping ratio of four h293 samples from the BSMAP mapping tool. The results determined that the proper number of the trimmed base from the 5′ of read2 and the end of read1 was 10 bp, which produced most of the viable amount of usable bases (of which, the minimum value of Q30 was up to 90.41%) and increased the mapping efficiency from 60 to 80% on unique mapping ratio (Fig. 2a; Additional file 4: Fig. S2). We loaded those cleansing datasets into BCREval [45] software to estimate the bisulfite conversion of those four libraries. The result showed that the conversion ratio was 96.685% for h293_s1, 95.745% for h293_s2, 93.97% for mouse_s1, and 93.41% for mouse_s2, respectively. That may result in false positive outcomes, such as a high level of non-CG methylation status. One particular challenge in carrying out WGBS is the base bias of the library. Therefore, we pooled ~ 40% PhiX spike-in, a substantial spike-in of DNA of a balanced base composition, to enhance sequencing quality. The result of base bias (Additional file 5: Fig. S3) showed the high-quality curated data. Besides, the relative decline of quality near the end of reads occurred on both sequencers. That may have been caused by the accumulation of phasing and prephasing errors in the PE150 sequencing. Based on this trimmed cutoff, we observed that the percentage of T base in whole reads from two platforms showed higher consistency and unbiasedness in PBAT step from another side (Additional file 6: Fig. S6a).
Using the QC-passed reads to align reference, we found that the four h293 data spread a balancing coverage, depth, and GC%, the same as mouse samples (Fig. 2c; Additional file 6: Fig. S6b). After the mapping stage using the BSMAP aligner, the data from GenoLab M reached a more robust depth (35x) [1] than that from NovaSeq, ranging from 28× to 33×, no matter in genome-wide or at chromosome-level (Fig. 2c; Additional file 6: Fig. S6c and d). The depth and coverage of data belonging to mouse samples showed consistency in the two platforms. The two replicates per platform and the average depth and coverage satisfied the modest sensitivity and specificity of standard WGBS [46].
Comparison of those mapping performance and methylation conversion
Apart from the data qualification, the mapping ratio was the second challenge mainly affected by computational methods and would impact the final methylation calls [27]. Herein, we then focused on comparing the mapping ratio and subsequently methylation conversion from the perspective of the platforms as well as data processing pipelines. From the platform’s perspective, it revealed that the alignment ratio was relatively higher in two h293 curated data from the NovaSeq platform, which was similar in four mouse curated data (Fig. 3a and c). The proportion of duplicated reads was much higher on NovaSeq (mean 14.94%) than that on GenoLab M (mean 3.05%) (Fig. 3a and c). In the comparison of software, we found that the alignment percentage for BSBolt was the highest and was superior to 94% on four h293 samples, and to 91% on four mouse samples. The following was BetMeth2, which could obtain at least 91% unique mapping reads on four h293 samples, and 86% unique mapping reads on four mouse samples. The performance of the BS Seeker2 aligner was the worst. The Bismark could find more duplicated reads on data from NovaSeq (mean 14.9%) than the other four sorts of software.
Additionally, multiple reads uncovering each methylated cytosine could be used as a readout of the fragmentation of the sequences within the sample that are methylated at that location, here represented as the methylation level of a specific cytosine. Therefore, we observed differences in methylation levels of the specific cytosine between the sequencing platforms and among the software of methylation calls. The detectable genome-wide Cs mainly contain non-CG context (CHG and CHH, where H = A, C, or T) and CG context (CpG sites). The methylated Cs percentile of curated data from the two platforms showed high conformity in both h293 and mouse samples. In the four sample reads (two h293 samples and two mouse samples) on GenoLab M, the methylated Cs of CpG (mCG) sites were significantly more than the methylated Cs of CHG (mCHG) and CHH (mCHH) (Fig. 3b and d). This trend of methylated Cs was similar in four samples on NovaSeq, indicating that at sites of non-CG methylation only a fraction of the surveyed genomes in those samples were methylated and data from the two platforms were entirely consistent. However, analysis of these loci from four software showed that BSMAP and BatMeth2 could reveal much more methylated Cs (~ 70–80%) than the remaining methods (~ 65%). This quantitative difference was much smaller than the unique alignment ratio.
Assessment of the agreement between DNA methylation landscapes generated by pipelines versus that from two platforms
After strict curating the pre-processing and processing step, we evaluated the performance of two platforms for DNA methylation quantification and five processing pipelines for DNA methylation calls. In the genome of mammals, DNA methylation occurs mainly at cytosine followed by guanine, namely CG methylation (mCG) profile. Contrastingly, methylation status at cytosines followed by bases other than guanine is referred to as non-CG methylation (mCHG, mCHH, where H = A, C, or T) profiles. For two platforms, we focused on those three methylated Cs profiles (mCG, mCHG, mCHH) throughout each autosome and chrX. Of the detected Cs site in the CpG context, the density profile of mCG displayed consistency throughout each autosome and chrX from a global-scale view. On the contrary, the density profile of mCHG and mCHH showed more variations across each chromosome (Fig. 4a). Considering the testing mCs in all and the CpG context, we found that the GenoLab M platform calls considerably higher mCs relative to the NovaSeq platform (Fig. 4d). The common quantified mCpG sites accounted for ~ 50%.
For those processing pipelines, the profiled all Cs sites in the CpG context of the human genome showed that the Bismark could detect the most CG context, up to approximately 59 million, containing almost 99% CG context from the other four pipelines (Fig. 4c) and similar with the previous amount [4]. The density profile of mCpG was highly in agreement with data metrics from five processing pipelines. Conversely, the density profiles of methylated non-CG context from BSMAP and Bismark were similar and more noticeable than that from BSBolt, BatMeth2 and BSSeeker2. As a result of BSMAP, the density profile of methylated non-CG context on two platforms show more consistency from two replicates, which is vice versus Bismark. The BSSeeker2 uncovered the lowest density profiles of mCHG and mCHH among all software (Fig. 4a). For mCG loci, the correlation analysis of biological replicates and different approaches showed that BSMAP performed the best between replicates and platforms (Additional file 7: Fig. S4a).
To characterize the similarity of mCG and the clusters of those biological samples from all approaches, we performed the principal component analysis (PCA) using all mCG data. We generally observed a higher correlation between data from BSMAP and Bismark (Fig. 4b), a similar trend to the genomic-wide view of DNA methylation patterns (Fig. 4a). Among this result, two replicates on GenoLab M showed more similarity than two on NovaSeq from Bismark. Moreover, two replicates on NovaSeq showed more similarity than two on GenoLab M from BSMAP. For combined mCHG loci, the PCA analysis showed that the mCHG is more complicated and inconsistent among all pipelines and platforms (Additional file 7: Fig. S4b).
Hereafter, we combined two replicates to represent each platform and implemented the correlation analysis through pairwise comparison of mCG to excavate the concordance of data on two platforms from five approaches. We calculated the value of Jaccard similarity between pairs, and showed those correlation values via a heatmap plot (Fig. 4e). We observed that the data from BSMAP, BS-Seeker2, and Bismark showed more consistency, especially from BSMAP and BS-Seeker2. Furthermore, the data on two platforms from BSSeeker2 showed the most consistency. Intriguingly, that condition was observed in the analysis of identifying the common mCG sites and mCs sites according to the post-filtering mCG sites (depth per site ≥ 4) (Fig. 4d), which obtained the most common mCG sites and mCs sites up to 68%. However, a lower correlation was observed between mCG sites on two platforms from BatMeth2 and BSBolt (Fig. 4e). Due to the analysis of identifying the common mCG sites and mCs sites, we observed that of the methylcytosine detected in h293 on GenoLab M from BSMAP and BatMeth2, up to 88% were obtained in the CG context, and the total number of mCG sites was lower up to 84% on NovaSeq. That was a similar proportion between the two platforms from Bismark. For all pipelines, the common mC loci and mCG loci were lower in balance, and those were slightly increasing only for BSMAP and BSSeeker2, which was consistent with the result of correlation analysis (Fig. 4d and e). We further compared the paradigm of mCG methylation, which was validated by previous research [40], with our results of h293 from two platforms and four pipelines. The graph (Additional file 7: Fig. S4c) showed roughly similar methylation levels throughout each autosome and chrX.
Apart from the results’ concordance of the mappers based on the two platforms above, we simultaneously recorded the computational time on each procedure, including building reference index, pre-processing, alignment, post-processing, methylation calling, the convenience of use, and total time based on the same operating system configuration (Ubuntu v20.04 operating system, containing 2 physical CPUs, 48 logical CPUs, and 377G RAM size). We observed that BS-Seeker2 took the longest time in reference index, alignment, and methylation calls, whose time ranged from fourfold to tenfold longer than other pipelines. On the contrary, the BSBolt took the shortest time for the whole five procedures, which was only ~ 10 h on analysis of h293, following with BSMAP (~ 11 h) (Table 1; Additional file 8: Table S2).
Consistency of methylated pattern of specific CG sites among data from two platforms and previous studies
Despite the general concordance of the mapping outcomes, the results of the methylated calling demonstrated that BSMAP provided attractive robustness between replicates and platforms. Consequently, we further dissected the paradigm of mCs’ and mCGs’ methylation over genomic annotations from the result of BSMAP. Methylation rates plotted over GRCh38.p14 genomic annotation database was generated by aggregating all mCs methylation and mCG methylation fractions in percentile windows for 5000 bp upstream of the gene TSS, through the geneset, and the 5000 bp downstream of the gene TES. The result (Fig. 5a and b) revealed the same feature as the previous study [4].
We simultaneously compared the specific CpG of two genes in the human embryonic kidney cell line, which was validated by previous research [39], among our methylation results from BSMAP and those known results from NCBI [40,41,42]. The bubble plot showed highly consistent methylation patterns within all five datasets, indicating the accuracy of mCG sites on two platforms (Fig. 5c; Additional file 9: Fig. S5).
Read more here: Source link