The removal of problematic regions
The removal of problematic genomic regions is considered essential for the accurate analysis of data obtained by chromatin immunoprecipitation followed by genome sequencing (ChIP-Seq) [27, 35]. Repetitive regions including satellite DNA arrays comprise a majority of such problematic regions, mainly because they reside in the unassembled part of the reference genome, so their actual sequence may be collapsed. The consequence of this is that the sequencing reads accumulate at such regions and the signal is interpreted as much higher than it actually is, which leads not only to false-positive peaks, but also to erroneous normalization of signal between samples . We identified annotated satellite DNA instances that overlapped the coordinates of problematic regions compiled by the ENCODE project , which we refer to as the blocklist throughout the manuscript [36, 37]. About 50% of satellite instances for hg19 assembly and 0.05% for mm10 assembly overlapped their corresponding blocklists (see, Additional file 2: Table S3 and S4). These regions are expected to have a high ratio of sequencing reads that cannot be uniquely mapped to a single genomic position, due to the repetitive nature of underlying identical or nearly identical genomic sequences. These reads are assigned zero mapping quality during alignment to the reference genome. To estimate their content at satellite elements on and outside of the blocklist, we analyzed read alignments of two arbitrarily chosen cell lines, HMEC and A549 (Additional file 2: Text S2 and Figure S1). The results suggest that the H3K9me3 signal at satellite elements that did not intersect any component on the blocklist is less likely influenced by ambiguously mapped reads. Therefore, only these satellite elements were considered for further analyses.
Reduced H3K9me3 level at satellite DNA repeats in cancer cell lines relative to normal cells
To examine the relative enrichment of H3K9me3 at annotated satellite elements in the human genome, we analyzed existing maps generated by the Broad/MGH and the Stanford/Yale/USC/Harvard ENCODE groups using ChIP-Seq in different cell lines, derived from various tissues and corresponding to normal or cancer karyotype. As starting data, we used density graphs of signal enrichment based on aligned read density. After normalizing for the differences in sequencing coverage between samples (see “Methods” section), the average signal was computed for each satellite element. The enrichment of H3K9me3 was calculated as fold change (FC) of such signal in ChIP experiment to the input control.
A large majority of satellite elements are considered enriched for H3K9me3. Only 1.2% (53) elements have log2 FC below 0.3 (corresponding to FC < 1.23) in all 21 cell lines, which can be considered not to be methylated at all or just slightly methylated. 94% of elements (4,143) have > = 0.58 log2 FC (FC > = 1.5) with up to 4 (FC = 16; Additional file 3: Table S5). The average log2 FC for all 4,406 annotated satellite elements outside of the blocklist is 0.61 (corresponding to FC of 1.5); 0.63 for normal cells and 0.57 for tumor cells (Additional file 2: Figure S2). Satellite families which show the highest overall level of H3K9me3 are ACRO1, (CACTT)n and (GAATG)n [38, 39], while GSAT, GSATX, and GSATII [40,41,42] are enriched for H3K9me3 in normal cell lines but not in cancer (Additional file 2: Figure S3). There is a slightly higher H3K9me3 level at (CACTT)n and ALR/Alpha repeats [43, 44] in cancer cells, but the overall data suggest lower methylation at satellite repeats in cancer compared to normal cell lines (Additional file 2: Figures S2 and S3).
Epigenetic changes are frequently associated with cancer , hence it can also be expected that H3K9me3 state may be disrupted at satellite loci in situations of abnormal development. To check if cancer cells show different patterns of H3K9me3 at non-centromeric satellite loci compared to normal, we performed principal component analysis (PCA) based on FC values at autosomal satellite instances. PCA plot showed clustering by karyotype (Fig. 1a), albeit with a borderline significance (p-value = 0.033; PERMANOVA), whereas clustering by tissue lineage and sex was not observed (Additional file 2: Figure S4).
To identify satellite elements that show differences in the level of H3K9me3 between normal and cancer cell lines, we performed a two-sample t-test and found 407 satellite elements that are differentially enriched for H3K9me3 mark between the two groups. After performing adjustment of p-values for multiple testing by Bonferroni, none of these comparisons came out significant. However, it has been suggested that corrections for multiple testing in ChIP-Seq peak calling may lead to an underestimation of the number of many regions that show substantial enrichment that may be biologically relevant . We reasoned that this may apply to at least a subset of these elements, especially because H3K9me3 mark is of broad type, i.e., no concrete, clearly defined peaks are expected. We, therefore, decided to continue the subsequent analyses with all of the 407 elements, conscientiously having in mind the implications of such a decision on the interpretation of results and conclusions.
The majority of satellite elements that are differentially enriched for H3K9me3 are located on autosomes (382), 5 elements are on chrX and 20 elements are found to be significantly differently enriched for H3K9m3 between male derived cells on chrX and chrY (Fig. 1b and Additional file 4: Table S6). For elements with significant differences between normal and cancer, average log2 FC was threefold higher in normal cells (0.89; median 0.82) than in cancer cells (0.31; median 0.10; Fig. 1c). SYDH cell lines appear to be more undermethylated than Broad cells (Additional File 4: Table S6). Although these may be biologically relevant differences, it cannot be excluded that they arise from different handling during the experiments in the two groups which generated the samples. We used the circos program  to visualize positions of annotated satellite elements in the human genome (Additional file 2: Figure S5). Expectedly, those that are on the blocklist are located mainly at pericentromeres and near telomeres, while a large number of elements suitable for the analysis is scattered throughout the whole chromosomes. Satellite elements that show differential enrichment between cancer and normal cells seem to be randomly distributed on all chromosomes, with the exception of several clusters, located mostly at pericentromeres. Some of these clusters contain a large number of elements, such as the ~ 115 kb cluster on chr12 composed of 134 annotated GSATII elements, 120 of which show differential H3K9me3 enrichment between normal and cancer cell lines (11 elements annotated within this cluster were excluded from differential enrichment analysis due to their overlap with blocklisted regions). The whole cluster is generally enriched for H3K9me3 in normal cell lines (with an exception of PBMC), in contrast with low H3K9me3 level in analyzed cancer cells apart from Dnd41 (Fig. 2). In some cell lines, the H3K9me3 enrichment or the lack of it is limited to the satellite cluster region, such as in GM12878, NHA, NHEK, NT2D1, A549 and HepG2, and in some others, it spreads out to the neighboring regions, such as in H1-HESC, monocytes-CD14 + , HUVEC, Dnd41, and K562.
We questioned whether the loss of H3K9me3 repressive chromatin mark in cancer cells is accompanied by the accumulation of an activating histone modification and analyzed the level of H3K4me1 as one of the typical marks of euchromatin, at satellite DNA elements that show differential enrichment of H3K9me3 between normal and cancer cell lines. We find that these elements are generally not enriched in H3K4me1 (median of log2 FC for each group is < 0.0001; Additional file 2: Figure S6A). Although some elements are characterized by a higher H3K4me1 level, there is no general trend that would be specific to a group of cell lines (Additional file 2: Figure S6B).
Pericentromeric heterochromatin is generally characterized by a decrease of silent histone mark H3K9me3 in cancers relative to normal cells [6, 16]. The larger enrichment of H3K9me3 that we observe in normal cells may simply reflect their larger enrichment at (peri)centromeres, such that they arise from an accumulation of reads that match to multiple positions. To check if this is the case, we analyzed the fraction of zero mapping quality reads aligned to the 407 satellite elements that show differential FCs between cancer and normal cells. About 5.5% of satellite elements contain exclusively zero mapping quality reads, but they are supported by less than 6 reads on average, compared to 32 reads on average for the rest of the elements. About 85% of elements actually contain less than a quarter of ambiguously mapped reads (Fig. 1d). This suggests that identification of differentially enriched H3K9me3 elements is not affected by reads that align to multiple positions such as highly repetitive (peri)centromeric regions.
To explore if derived H3K9me3 enrichment reflects nucleosome occupancy at satellite loci, we analyzed the only MNase-Seq data based on hg19 assembly that was publicly available for the cell lines in this study, i.e., K562 and GM12878. We calculated the average MNase-Seq signal per each satellite element and compared their distribution with the distribution of the calculated average signal on permuted coordinates based on data from K562 (Additional file 2: Figure S7A). The results suggest that nucleosome occupancy at analyzed satellite elements resembles average occupancy throughout the genome. We found no correlation between average MNase-Seq signal and average FC (Additional file 2: Figure S7B), suggesting that the derived H3K9me3 enrichment is not a reflection of nucleosome occupancy in the two analyzed cell lines.
H3K9me3 signal is not restricted to satellite DNA
Because broad histone marks such as H3K9me3 have no clearly defined peak summits and the signal is generally flat compared to other marks, the pattern of enrichment is better described as “domains” than “peaks” [26, 48]. However, we reasoned that, if some domains of signal enrichment are localized to satellite element(s), these elements may be subject to regulation or hold a regulatory function. In order to estimate what proportion of satellite loci with H3K9me3 enrichment have signal limited to satellite element, we compared their coordinates with coordinates of H3K9me3 peaks (StdPk files) as called by the Broad and SYDH ENCODE groups. There were on average 35 (stdev = 13) elements per cell line that reciprocally overlap at least 50% of length with coordinates of called peaks. However, this number declines to 17 (stdev = 9) when only loci with signal FC > = 2 are considered. This translates into 399 unique satellite elements regardless of their FC or 213 of those with FC > = 2. Hence, when only elements with FC > = 2 are considered, on average and at best, less than 2% of satellite elements per cell line can be considered to have peaks specific to satellite elements (Additional file 2: Table S7). Such low proportion obtained by applying loose criteria for reciprocal overlap suggests that the satellite elements are part of larger domains of H3K9me3 enrichment. The visualization of FC at a subset of satellite regions that show differential H3K9me3 enrichment between cancer and normal cell lines further corroborates this (Additional file 2: Figure S8). It also shows that the H3K9me3 enrichment does not generally match locations of previously called peaks. This lack of concordance can be partially explained by differences in the signal normalization strategy and the fact that the cutoff value was applied for peak calling. However, as shown in other analyses here, there is a clear reduction of H3K9me3 enrichment at non-centromeric satellite elements in cancer cell lines compared to normal cell lines.
Transcription is not correlated with H3K9me3 level
It has been established that the trimethylation of H3K9 is associated with transcriptional repression . Given that cancer cells show reduced H3K9me3 at a subset of satellite loci, we checked if they also have increased transcription of these regions compared to the normal cells. We analyzed RNA signal for all 4406 satellite loci outside of the blocklist, for cells in which RNA-Seq data were available (see “Methods” section). Overall, transcription evidence was found at 30% (1326) of the loci for long polyadenylated RNA and 36% (1567) for long non-polyadenylated RNA, and much lower evidence of transcription is seen for small RNA, i.e., at only 3.6% (158) of satellite DNA loci overall. We overlapped coordinates of satellite elements with RNA contig coordinates and calculated transcript level at each satellite locus as average BPKM or RPKM. There were no RNA contigs that fully overlapped with satellite elements, i.e., satellite elements are parts of longer transcripts that start and/or end outside of the satellite sequence. Cumulative distribution showed no substantial difference in the transcript level at satellite loci between cancer cell lines and normal cell lines (Fig. 3). We also found no significant difference at any of the satellite loci that show differential enrichment of H3K9me3 between the two groups.
We analyzed the chromatin landscape at satellite elements based on genome segmentation of the cell lines for which the data were previously produced . Expectedly, the satellite DNA elements are found to be enriched for repressed chromatin state, and the active state was also underrepresented (compared to genome average) at elements that were identified to be differentially enriched for H3K9me3 between cancer and normal cells (Additional file 2: Figure S9).
Stage-specific enrichment of H3K9me3 at annotated satellite DNA elements during fetal development in mouse
Mouse embryogenesis is characterized by the dramatic remodeling of constitutive heterochromatin which is essential for the development and epigenetic reprogramming [2, 19] and genome-wide profiling of H3K9me3 in early mouse embryos revealed distinct H3K9me3 dynamics in promoters and long terminal repeats (LTRs) . In the mouse reference genome, there are in total 28,937 annotated satellite elements outside of the problematic genomic regions. Given their abundance, some may hold a regulatory potential exerted by the H3K9me3 level. In order to explore this idea, we analyzed publicly available ChIP-Seq data in multiple tissues during fetal development in mouse produced by the UCSD/Ren ENCODE group . We found that the H3K9me3 is generally enriched at annotated satellite DNA instances in the mouse genome (average fold change per developmental stage 1.6; standard deviation 0.1). This enrichment only slightly varies across analyzed stages: the lowest (1.5) average enrichment (fold change of signal over input calculated across all annotated elements) is in samples collected at birth, and the highest (1.8) in samples analyzed at 12 days after conception.
PCA based on FC values at autosomal elements showed significant clustering of samples by developmental stage (p-value < 0.001; PERMANOVA; Fig. 4a) and not by tissue (p-value = 0.253; Additional file 2: Figure S10). Analysis of variance followed by post hoc test identified 864 satellite elements with significant and biologically meaningful (see “Methods” section) differential enrichment of H3K9me3 between at least two developmental stages (Fig. 4b and Additional file 5: Table S8).
The largest number of elements showing major differences was found between 10 days after conception and all other analyzed stages (Additional file 2: Figure S11). This set of elements shows dynamic changes in H3K9me3 level during the fetal development: the H3K9me3 enrichment is highest at 10 days after conception, drops down significantly in one day older fetuses, increases to some extent in 12-, 13- and 15-days old fetuses to become again substantially reduced before and at birth. This dynamics is in contrast with the more or less constant global level of H3K9me3 at all annotated satellite DNA sequences during development (Fig. 4c). No clustering of elements into distinct groups based on their differences in the H3K9me3 level was observed (Fig. 4b). MMSAT4 family  was over-represented in this set with 5.8-fold more elements than expected given the number of its annotated instances in the whole genome (hypergeometric test, p-value < 2.5e−135), although there were large variations in the proportion of satellite families when individual pairwise comparisons are considered (Additional file 2: Figure S12). This over-representation can be explained by the physical proximity of multiple elements, localized within a larger region that is differentially methylated at H3K9 during fetal development. Indeed, the genomic distribution of elements that show differential H3K9me3 enrichment roughly correlates with that of all annotated satellite elements in the mouse genome (Additional file 2: Figure S13).
Satellite DNA units that are supported by ambiguously mapping reads may form a significant fraction of the functional heterochromatic sequence. We sought to analyze H3K9me3 levels on these repeats to see how they differ from satellite elements in unique genomic regions during mouse development. For this purpose, we started from the unfiltered alignment files which contained ambiguously mapped reads. After duplicate read removal and removal of unmapped reads, all alignment files were subsampled to the same number of reads, to account for the differences in sequencing depth between samples. Reads that mapped to multiple locations were extracted and re-mapped to a reference that consisted of satellite DNA consensus sequences. Lower stringency options were used for alignment, to account for the mismatches with reference consensus sequence, i.e., natural variation in monomer sequence between individual repeats. Finally, H3K9me3 enrichment was calculated as a ratio between the total number of aligned reads in a ChIP experiment and the corresponding input sample. Expectedly, the vast majority of reads aligned to major and minor mouse satellite DNA, GSAT_MM and SYNREP_MM, respectively. We find that the H3K9me3 at major and minor mouse satellite shows generally similar dynamics as shown for satellites in unique regions (compare Additional file 2: Figure S14, with red dataset in Fig. 4c). Apart from IMPB_01 satellite, there were no reads aligned to other five satellite DNA families, suggesting their location to be entirely in unique genomic regions. Although many reads were aligned to IMPB_01 consensus, the coverage was strongly non-uniform, with reads almost exclusively piled up onto low complexity dinucleotide sequences. According to a Repbase report, a repetitive pattern of IMPB_01 is present on mouse chromosome 16 and consists of three conservative parts separated by low complexity insertions of variable length . Although these elements are annotated as satellite DNAs at many instances throughout mouse mm10 genome assembly, to the best of our knowledge there is no published study that investigates sequence and distribution of IMPB_01 repeats.
Unlike human satellite repeats, over 50% of satellite repeats in mouse are found within 10-kb distance of genes . Furthermore, mouse satellite repeats are strongly enriched in the CDS of a specific group of protein-coding genes and implicated in their function, regulation and expression. In order to examine the potential role of H3K9me3 in such processes, we analyzed if gene-overlapping satellite elements are enriched in the set of satellite elements that show significant differences in the H3K9me3 during mouse fetal development. Among 864 elements, 54% overlap genes (470). This proportion is similar to the proportion of elements associated with genes when all annotated instances outside of problematic regions are considered (14,442/28,937). The presence of H3K9me3 within a subset of genes has been revealed before for mouse cells  and here we show that about half of the genes associated with satellite DNA sequences show differentially enriched H3K9me3 during fetal development.
Read more here: Source link