Subtype and cell type specific expression of lncRNAs provide insight into breast cancer

lncRNA expression according to breast cancer clinicopathological subtypes

To identify lncRNAs expressed by specific breast cancer subtypes or associated with clinicopathological features, we analyzed RNA-sequencing data from two large independent breast cancer cohorts: SCAN-B (n = 3455)17 and TCGA-BRCA (n = 1095).

We focused on lncRNAs annotated in the Ensembl18 v93 non-coding reference transcriptome (Supplementary Fig. 1), and identified 4108 lncRNAs present in both cohorts, which are further analyzed in this study. A small number of lncRNAs (100 in SCAN-B, 37 in TCGA) were expressed >1 transcript per million (TPM) in all patients, but the majority of lncRNAs were expressed at lower levels. The two cohorts differ in the distribution of patients expressing lncRNAs>1TPM (Supplementary Fig. 2). Such sparsity of the lncRNA expression in the two cohorts highlights the importance of analyzing at least two independent breast cancer cohorts to robustly identify the lncRNA associated with clinicopathological features. Hierarchical clustering of the log2 expression values of the 4108 lncRNAs clearly grouped ER positive from ER negative patients, (Fig. 1a: SCAN-B and Fig. 1b: TCGA), indicating an association between breast cancer subtypes and lncRNA expression.

Fig. 1: lncRNA expression in breast cancer subtypes.
figure 1

Hierarchical clustering of log2(TPM + 1) of 4108 lncRNAs expressed above filtering thresholds (see Methods) in the SCAN-B (a), and TCGA-BRCA (b) cohorts. Estrogen Receptor (ER) and Her2 status, as well as PAM50 subtypes are annotated at the top of the heatmap. The expression gradient (blue to red) represents scaled and centered log2(TPM + 1). ce Dot plot of the log Fold Change (FC) from the differential expression analysis using a fitted Limma model (lmfit) and moderated t-statistic (eBayes) between patients of different subtypes in SCAN-B (x-axis) and TCGA-BRCA (y-axis). Each dot represents a lncRNA, while the colour indicates the subtype with the highest expression c ER positive (blue) and ER negative (red), d Her2 negative (dark blue) and Her2 positive (pink). e Luminal A (dark blue), and Luminal B (light blue). Gray dots are lncRNAs that are not significantly differentially expressed, while black dots represent lncRNAs with opposite fold change (FC) in the two cohorts. The number of patients in each clinical group were as follows: ER positive (n = 2409 and n = 807), ER negative (n = 504 and n = 237), Her2 positive (n = 458 and n = 114), Her2 negative (n = 2845 and n = 650), Luminal A (n = 1769 and n = 562), and Luminal B (n = 766 and n = 209) in SCAN-B and TCGA-BRCA respectively.

To further identify the lncRNAs associated with breast cancer subtypes, we performed differential expression analysis using linear modeling and empirical Bayes moderation. We report lncRNAs with significant differential expression according to the ER status (Fig. 1c) and HER2 status (Fig. 1d). The significant Pearson correlation between the log fold change (FC) in the SCAN-B and the TCGA cohorts (r = 0.93, p-value < 2.2e-16: ER status and r = 0.75, p-value < 2.2e-16: HER2 status) show that we identify with high confidence lncRNA differentially expressed according to pathological breast cancer subtypes.

On each plot (Fig. 1c, d), we highlight the lncRNAs with the highest absolute fold changes in each breast cancer subgroup. Detailed results from the differential expression analysis are available in Supplementary Data 1. FOXCUT was the most significantly deregulated lncRNA over-expressed in ER negative tumors with the highest fold change in both SCAN-B and TCGA, it has been previously shown to enhance proliferation and migration in ER negative breast cancer cell lines19.

We further performed all pairwise differential expression analyses within the five molecular PAM50 subtypes, Luminal A, Luminal B, Her2-enriched, Basal-like and Normal-like. Figure 1e shows the results of such analysis for Luminal A versus Luminal B, two subtypes considered to be closely related, as they are both ER positive; however, we still report 1448 differentially expressed lncRNA between these two subtypes. All pairwise comparisons considering PAM50 subtypes are presented in Supplementary Fig. 3 and Supplementary Data 1.

Few lncRNAs have been associated to patient outcome20. To assess the relevance of lncRNA expression robustly and systematically with regards to breast cancer prognosis, we performed Cox proportional hazards regression analysis in the SCAN-B cohort in ER + and ER patients separately. 305 lncRNAs were significantly associated to overall survival of ER + patients in the SCAN-B cohort, of which MAPT-AS1, AP000851.1, AP000851.2, and ROCR could be validated in TCGA-BRCA (Supplementary Data 2). MAPT-AS1 has been previously shown to be associated with better patient outcome in breast cancer patients21. ROCR, the lncRNA with highest expression in the Luminal A subtype was also associated with ER + prognosis. 77 lncRNAs were associated to overall survival within the ER- group in the SCAN-B cohort, however, none of these were significantly associated with survival after multiple testing correction in the TCGA-BRCA cohort.

To our knowledge, this initial analysis is the first to robustly identify differentially expressed lncRNAs according to breast cancer clinicopathological features and molecular subtypes in two large and independent cohorts.

Clustering lncRNAs according to high degree of co-expression with protein coding mRNAs

To associate lncRNA expression to known biological functions, we used a co-expression approach (Supplementary Fig. 4a) between lncRNA (n = 4108) and protein coding genes´ mRNA (n = 17060). Retaining the significant Spearman correlation coefficients of all lncRNA-mRNA associations in both cohorts (Bonferroni corrected p-value < 0.05), led to n = 15407856 significant correlations. On average, each lncRNA was significantly correlated with the expression of 95 mRNAs (Supplementary Fig. 4b), while each mRNA was on average correlated with 20 lncRNAs (Supplementary Fig. 4c). Among the lncRNAs associated to the expression of the highest number of mRNAs, we found a non-coding RNA activated by DNA damage (NORAD), known to regulate genome stability22, as well as other lncRNAs with known function in DNA-damage response, including the estrogen responsive LINC0148823.

We then performed unsupervised clustering of the significant correlations with an absolute Spearman Rho >0.4 and involving lncRNAs and mRNAs with more than the average number of significant correlations (Supplementary Fig. 4b, c). All significant correlations fulfilling these criteria are denoted in Supplementary Data 3. We identified three lncRNA clusters (x-axis) which correlated with three mRNA clusters (y-axis) (Fig. 2a). Interestingly, most of the correlation coefficients (99.8%) were positive, showing more positive correlations between mRNA and lncRNA than expected by chance. To assess whether the discovery of our three biclusters was driven by the filtering criteria used to select lncRNA and mRNA, an unsupervised clustering including all lncRNAs and mRNAs allowed the rediscovery of the three biclusters, however with much more sparsity (Supplementary Fig. 5).

Fig. 2: Clustering of lncRNA into relevant pathways for breast cancer.
figure 2

a Hierarchical clustering of lncRNA-mRNA Spearman correlation values (positive correlation in red, negative correlation in blue) following co-expression analysis between lncRNAs (n = 4108) and protein coding mRNAs (n = 17060). Only lncRNA and mRNA with significant correlation (Bonferroni p-value < 0.05) and −0.4> Spearman’s rho > 0.4 in the TCGA (n = 1095) and SCAN-B (n = 3455) cohorts are used in the unsupervised clustering. In addition, we plot only lncRNAs and mRNAs with number of association higher than the mean value of association (Supplementary Fig. 4). Clusters are defined using cutree_rows = 3 and cutree_cols = 3. lncRNAs (x-axis) are annotated according to the differential expression analysis (Fig. 1). b, d Bar plot showing -log(FDR q.value) from a hypergeometric test (y-axis) of gene set enrichment analysis using Hallmark pathways of the MSigDB database. Input genes for GSEA are genes from mRNA-cluster A (n = 2890) (b), mRNA-cluster B (n = 1480) (c), and mRNA-cluster C (n = 667)(d). Boxplot of the coefficients from the generalized linear modeling of the expression of lncRNAs in the SCAN-B cohort using three variables into the same model, ESR1 mRNA (to reflect estrogen signaling (e)), fibroblast score (to infer fibroblast tumor content (f)) and lymphocyte score (to infer lymphocyte infiltration (g)). Each dot represents the coefficient for a variable and each lncRNA in cluster 1 (n = 610), cluster 2 (n = 199), and cluster 3 (n = 110). Kruskal-Wallis test p-values are shown. The line within each box represents the median. Upper and lower edges of each box represent 75th and 25th percentile, respectively. The whiskers represent the lowest datum still within [1.5 × (75th  −  25th percentile)] of the lower quartile, and the highest datum still within [1.5 × (75th  −  25th percentile)] of the upper quartile.

To link the clustered lncRNAs to the differential expression analysis performed according to breast cancer subtypes, we annotated lncRNAs according to whether they were found overexpressed in the respective groups compared in Fig. 1c–e (column annotations of the heatmap). We observed that lncRNA-cluster 1 and 3 were populated by lncRNAs overexpressed in ER positive and ER negative cases, respectively.

Grouping lncRNAs into pathways related to breast cancer pathogenesis

Following the unsupervised clustering (Fig. 2a), we found a high degree of significant and dominantly positive correlations between the (i) lncRNAs in cluster 1 and the mRNAs in cluster A, (ii) lncRNA-cluster 2 and mRNA-cluster B and (iii) lncRNA-cluster 3 and mRNA-cluster C. By performing Gene Set Enrichment Analysis (GSEA) using the genes of each mRNA-cluster as input, we could infer by association the pathways the lncRNA-clusters may functionally be associated with.

lncRNA-cluster 1 & mRNA-cluster A – Estrogen signaling cluster

91% of the lncRNAs in cluster 1 were significantly overexpressed in ER positive cases, when compared to ER negative, associating these lncRNAs with estrogen signaling. Further, we found that GSEA analysis of genes in mRNA-cluster A were significantly associated with the estrogen signaling pathway (Fig. 2b, Supplementary Data 3).

lncRNA cluster 2 & mRNA-cluster B – Fibroblast cluster

52% of the lncRNAs in cluster 2 were significantly overexpressed in ER positive and 21% in ER negative cases. Interestingly, 87% were overexpressed in Luminal A tumors when compared to Luminal B.

According to GSEA, genes of mRNA-cluster B are involved in Epithelial to Mesenchymal Transition (EMT) and Apical junctions (Fig. 2c, Supplementary Data 3). There is a high similarity between mesenchymal cells and fibroblasts24, and fibroblasts are strongly associated with shaping of the extra cellular matrix25. In addition, fibroblasts have been shown to be highly abundant in Luminal A breast tumors26. We therefore hypothesized that lncRNAs from cluster 2 may be expressed by fibroblasts of the tumor microenvironment.

lncRNA cluster 3 & mRNA-cluster C – Immune cluster

For the third lncRNA-cluster, 61% of the lncRNA were overexpressed in ER negative tumors. Protein coding genes of mRNA cluster C were highly correlated with lncRNA-cluster 3 and enriched among pathways reflecting activation of the immune system (Fig. 2d, Supplementary Data 3). Given the fact that ER negative tumors have significantly higher immune infiltration than ER positive tumors27, we hypothesized that lncRNAs from cluster 3 may be expressed by immune cells and / or modulate the immune tumor microenvironment.

Predicting cell type expression of lncRNAs

Having set hypotheses on which pathways and cell types clustered-lncRNA may be associated with, we aimed at providing further evidence for the cell type specific expression of lncRNAs using different approaches.

First, we modeled lncRNA expression as a multivariate function of ESR1 mRNA expression, fibroblast and lymphocyte infiltration scores reflecting fibroblast or lymphocyte tumor content. We tested which of the three variables explained best each lncRNA’s expression (Supplementary Data 4). To infer the lymphocyte and fibroblast content and calculate lymphocyte and fibroblast scores, we used bulk gene expression and the Nanodissect [23] or xCell28 algorithms, respectively (see Methods).

We found that ESR1 expression, fibroblast score, and lymphocyte score were the most significant explanatory variables for 82% of lncRNAs in cluster 1, 60% of lncRNAs in cluster 2 and 84.5% of lncRNA in cluster 3, respectively. Furthermore, when comparing the logistic regression coefficients, which reflect how much each variable explains lncRNA expression, we found that in average the ESR1-coefficients were significantly higher in cluster 1 (Fig. 2e, SCAN-B and Supplementary Fig. 6a, TCGA), the fibroblast-coefficients significantly higher in cluster 2 (Fig. 2f, SCAN-B and Supplementary 6b, TCGA) and the lymphocyte coefficient significantly higher in cluster 3 (Fig. 2g, SCAN-B and Supplementary Fig. 6c, TCGA).

These detailed analyses clearly divide lncRNA expressed in breast cancer in three categories, they are either expressed by cancer cells and belong to the estrogen signaling pathways or they are expressed by the main cell types of the tumor microenvironment: lymphocytes and fibroblasts.

Expression of lncRNAs in breast cancer cell lines and single cell RNA-seq data

To clearly associate lncRNAs with cell type specific expression in breast cancer, we investigated lncRNA expression in a panel of breast cancer cell lines29. Differential expression analysis of breast cancer cell lines according to molecular subtypes confirmed that the lncRNAs with significantly high expression in Luminal A and Luminal B (ER + ) cell lines belonged to cluster 1. The top three lncRNAs for both Luminal subtypes are shown in Supplementary Fig. 7a, b. Overall, cluster 1 lncRNAs were expressed at higher levels in the Luminal cell lines (Supplementary Fig. 7c, d). Cluster 2 lncRNAs, which we identify as mainly being expressed in fibroblasts of the tumor microenvironment, showed highest expression in cell lines of the Normal-like subtype. In cluster 3, 20% of the lncRNAs were not expressed in any breast cancer cell lines, but the remaining cluster 3 lncRNAs had the highest expression in Basal, Claudin-low, and Normal-like, ER- cell lines (Supplementary Fig. 7e–h). All the lncRNAs that significantly defined each subtype cell lines from the rest are included in Supplementary Data 5.

We next interrogated single cell RNA sequencing data from a study of 26 breast cancer patients30. Following dimensionality reduction and clustering of the 94357 cells from the study by Wu et al., we observed that the cluster of cells obtained overlapped perfectly with the cell type annotation published by the authors30, which included nine main cell types: normal epithelial, cancer epithelial, myeloid, T-cells, B-cells, endothelial cells, plasmablasts, Cancer Associated Fibroblasts (CAFs) and perivascular-like (PVL)-fibroblasts (Fig. 3a).

Fig. 3: lncRNA expression in single cell RNA-seq data.
figure 3

a UMAP of 94357 single cells from breast tumours colour-coded according to cell types. b, d Dot plot of lncRNAs (found in the scRNA-seq data set) with highest glm coefficient associated with the characteristics of each cluster, i.e ESR1 mRNA (Cluster 1), fibroblast score (Cluster 2), lymphocyte score (Cluster 3). Size of the dot represents the percentage of cells expressing the lncRNA, while the colour of the dot reflects the average expression in each of the UMAP-cell-type-cluster identified. Cluster 1 lncRNAs b, cluster 2 lncRNAs c, and cluster 3 lncRNAs d. eg Expression of one high ranking lncRNA from each lncRNA cluster plotted on the scRNA-seq UMAP. Cluster 1-lncRNA: GATA3-AS1 c, cluster 2-lncRNA: NR2F1-AS1 d, and cluster 3-lncRNA: LINC0861 e. Colour gradient (purple) represents Log normalized counts using scale.factor = 10000.

Dot plot analysis which reflects both average expression and percentage of cells expressing lncRNAs was performed for the lncRNAs with the highest logistic regression coefficient associated with each cluster characteristic feature (i.e, ESR1 expression for cluster 1, fibroblast infiltration for cluster 2, and immune infiltration for cluster 3) (Fig. 3b, d). We confirmed that lncRNAs of cluster 1 were expressed at higher levels in cancer epithelial cells, cluster 2-lncRNAs were mainly expressed by cancer associated fibroblasts, while lncRNAs of cluster 3 were expressed by immune cells. We further illustrate the expression of GATA3-AS1 (Fig. 3e), NR2F1-AS1 (Fig. 3f) and LINC00861 (Fig. 3g) on a Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). LINC00861 has been shown to be expressed in T-Cells in the tissue microenvironment (TME) of lung adenocarcinoma patients and was associated with better prognosis31. This lncRNA was also associated with better outcome in ER- patients in the SCAN-B cohort (Supplementary Data 2). Additional illustrations of lncRNA expression on the UMAP are included as Supplementary Fig. 8.

With these analyses, we directly identified lncRNA expression in either breast cancer cells, including cell lines, immune cells, or fibroblast of the breast tumor microenvironment.

Transcriptional regulation of expression at lncRNA promoters

lncRNAs are typically co-expressed with protein coding mRNA neighboring genes4. We aimed at characterizing lncRNAs regulatory regions in breast cancer.

To focus only on lncRNA specific regulatory regions and avoid analyzing regulatory regions from protein coding genes, we selected lncRNAs for which the promoter regions (transcription start site (TSS) −200/+100 bp) did not overlap with protein coding genes (Fig. 4a, Supplementary Data 6). Indeed, lncRNAs with promoters overlapping with protein coding genes had a higher level of co-expression with neighboring protein coding genes than independent lncRNAs and the nearest protein coding mRNA (Supplementary Fig. 9). We therefore further analyzed the promoters of lncRNA with no overlap with protein coding gene loci; either promoters of lncRNAs overexpressed in ER positive (n = 2320) or ER negative (n = 536) samples. We compared these two groups of promoters with respect to i) Chromatin accessibility measured by ATAC-seq in 74 TCGA-BRCA patients, ii) ChromHMM, chromatin genome segmentation, and iii) Transcription Factor (TF) – binding sites using the UniBind database32.

Fig. 4: Functional annotation of lncRNA promoters.
figure 4

a Schematic overview of the definition of lncRNA promoters not overlapping with a protein coding gene locus. bp: base pair; PC: protein-coding; TSS: transcription start site. b, c Average normalized counts for ATAC-seq peaks mapped to lncRNA promoters in estrogen receptor (ER) positive (+) (blue dots) (n = 58) and ER negative (−) (red dots) (n = 12) breast tumor samples from the TCGA-BRCA cohort. Wilcoxon test p-values are denoted. The line within each box represents the median. Upper and lower edges of each box represent 75th and 25th percentile, respectively. The whiskers represent the lowest datum still within [1.5 × (75th −  25th percentile)] of the lower quartile, and the highest datum still within [1.5 × (75th  −  25th percentile)] of the upper quartile. b Promoters of independent lncRNAs overexpressed in ER positive cases and c promoters of independent lncRNAs overexpressed in ER negative cases. d, e Enrichment of independent lncRNA promoters across ChromHMM genome segmentation from breast cancer cell lines. Enrichment is calculated as the ratio between the frequency of lncRNA promoters found within a specific segment type, over the frequency of all lncRNA promoters within the same segment type. The length of the bars (x-axis) shows the log transformed BH corrected p-value from the hypergeometric test. d Promoters of independent lncRNAs overexpressed in ER positive cases and e promoters of independent lncRNAs overexpressed in ER negative cases. Active Enhancer=EhAct, Active Promoter = PrAct, Repeat Zink Finger = RpZNF, Flanking Promoter region = PrFlk. f, g Swarm plots showing enrichment of TF binding sites (–(log10(p-value) using Fisher’s exact tests) on the y-axis for specific sets of promoters according to UniBind. TF names of the top 10 enriched TF binding sites data sets are annotated by colours. f Promoters of independent lncRNA overexpressed in ER positive cases and g promoters of independent lncRNAs overexpressed in ER negative cases.

lncRNA promoters are accessible in an ER-status specific manner

We found lncRNA promoters to be accessible in a lineage specific manner, i.e. promoters of lncRNA overexpressed in ER positive tumors were more open (higher Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) signal) in ER positive samples than in ER negative samples. Similarly, promoters of lncRNAs over-expressed in ER negative tumors showed significantly higher ATAC-seq signal in ER negative samples (Fig. 4b, c, Supplementary Data 6), suggesting that lncRNA promoters are highly regulated in a subtype specific manner.

lncRNA promoters are defined as active regions according to chromHMM

We assessed whether lncRNA promoters were enriched for specific chromHMM regions defined in subtype specific breast cancer cell lines. We mainly observed significant enrichment for ‘Promoter Flanking’ and ‘Enhancer’ (Fig. 4d, e, Supplementary Data 6). When expanding the window upstream of the TSS, the enrichment for ‘Enhancer’ marks became even more significant, with the lncRNAs over-expressed in ER negative tumors showing particularly significant overlap with ‘Enhancer’ marks in Basal like cell lines (Supplementary Fig. 10).

Specific transcription factors binding sites are found at lncRNA promoters

We next sought for enrichment in transcription factor binding sites (TFBS) in lncRNAs promoters using the UniBind database32. UniBind stores TFBSs with both experimental and computational evidence for direct TF-DNA interactions. We found ER + lncRNA promoters enriched for FOXA1 and ESR1 binding sites; TFs known to drive ER positive breast cancer (Fig. 4f, Supplementary Data 6). On the other hand, promoters of the lncRNAs highly expressed in ER negative tumors were enriched for BATF3, MAF, and RELA, components of the NF-κB TF complex, shown to be constitutively active in triple negative breast cancer33 (Fig. 4g, Supplementary Data 6).

To further assess the specificity of the TF binding according to length of the promoter chosen for the lncRNA, we assess three different sizes of promoters: TSS −300/+100 bp, TSS −500/+100 bp, TSS −1000/+100 bp. For ER + lncRNAs binding of ESR1 and FOXA1 dominated for all window sizes (Supplementary Fig. 11a–c). When extending the window upstream of the TSS for ER- lncRNA there was also enrichment for CEBPB, a transcription factor involved in inflammatory response34, and several additional AP-1 family members with known function in dendritic cell identity35 (Supplementary Fig. 11d–f).

Altogether, these results gave insight into the regulatory programs specifically at lncRNA promoters and showed that this regulation is closely related to estrogen receptor status in breast cancer.

Identifying distal regulatory regions for lncRNA

Finally, we sought for distal regulatory regions for lncRNA in breast cancer. We used our previously published method36, which is efficient at identifying distal enhancer and long-range interactions between enhancers and promoters through negative correlations between DNA methylation and transcript expression. We correlated the levels of DNA methylation at CpGs and lncRNA expression for all CpGs and lncRNAs on the same chromosome in two cohorts for which DNA methylation and lncRNA expression were available TCGA-BRCA (n = 603) and OSLO2 (n = 279). As the OSLO2 lncRNA expression was measured by Agilent microarray 60 K, we focused on 1027 lncRNAs found in both cohorts (Supplementary Fig. 12). For both cohorts, we identified 26342 CpGs significantly inversely correlated with 396 lncRNA (Bonferroni corrected Spearman correlation p-value < 0.05). We first tested in which chromHMM regions the CpGs whose DNA methylation was inversely correlated with lncRNAs were located and found them significantly enriched in enhancer regions (Fig. 5a, Supplementary Data 7). CpGs negatively correlated with lncRNAs highly expressed in ER positive tumors were found in open chromatin regions significantly more open in ER positive samples according to the TCGA-BRCA ATAC-seq data (Fig. 5b). Correspondingly, CpGs negatively correlated with lncRNAs highly expressed in ER negative breast cancer were found in regions significantly more open in ER negative tumors (Fig. 5c). Further confirming that the CpG in cis inverse correlation with lncRNA expression pointed at biologically relevant and active distal regulatory regions, we found such CpGs near binding sites of TFs described at breast cancer enhancers (Fig. 5d).

Fig. 5: Distal regulatory element in the LINC01488 locus.
figure 5

a Enrichment of CpGs with DNA methylation significantly inversely correlated with lncRNA expression across ChromHMM genome segmentation from breast cancer cell lines. Enrichment is calculated by comparing the genomic location of the CpG inversely correlated to all the CpGs on the 450k Illumina array as background. Active Enhancer = EhAct, Ehnancer Genic = EhGen, Transcription flanking = TxFlk. Average normalized counts for ATAC-seq peaks mapped to CpG location for which DNA methylation is significantly inversely correlated with lncRNAs with higher expression in ER positive cases (b) and higher expression in ER negative cases (c). ATAC-seq data from ER + (blue dots) (n = 58) and ER- (red dots) (n = 12) breast tumor samples from the TCGA-BRCA cohort. Wilcoxon test p-values are denoted. The line within each box represents the median. Upper and lower edges of each box represent 75th and 25th percentile, respectively. The whiskers represent the lowest datum still within [1.5 × (75th  −  25th percentile)] of the lower quartile, and the highest datum still within [1.5 × (75th  −  25th percentile)] of the upper quartile. d Swarm plot showing enrichment of TF binding sites (–(log10(p-value) using Fisher’s exact tests) on the y-axis for CpGs with DNA methylation inversely correlated with lncRNA expression. Names of the top 10 enriched TF binding sites data sets are annotated by colours. e Graphical illustration of the LINC01488 locus annotated for different epigenomic tracks. CpGs measured by the 450 k Illumina array are shown together with the significant negative correlations between levels of DNA methylation and LINC01488 expression in the OSLO2 and TCGA cohorts (blue arcs, negative expression-methylation correlation). ChromHMM Enhancer regions (active and genic) in the Mcf7 cell line (green) with ChiA-PET polII loop connecting the TSS of LINC01488 to the CpG in the enhancer region (pink arcs). TF binding of ESR1 (dark blue), FOXA1 (blue), and GATA3 (light blue) from ChIP-seq experiments (ReMap). f, g Correlation plot of levels of LINC01488 expression (x-axis) and levels of DNA methylation of the CpG (y-axis) in long-range interaction in e. Rho and p-value from Spearman correlation is indicated. f OSLO2 (ER positive, n = 214, ER negative, n = 52), g TCGA (ER positive, n = 807, ER negative, n = 237. h Graphical illustration of the LINC01488 locus annotated with ChromHMM Enhancer regions (active and genic) in the Mcf7 cell line (green) and ChiA-PET polII loop connecting LINC01488 to CCND1 (pink arcs). i Correlation plot of log2(TPM + 1) LINC01488 expression (x-axis) and log2(TPM + 1) CCND1 expression (y-axis) in ER positive (n = 2409) and ER negative (n = 504) patients in the SCAN-B cohort. Rho and p-value from Spearman correlation are indicated.

The LINC01488 locus provides a good illustration of distal regulatory regions possibly involved in the regulation of lncRNA expression (Fig. 5e). LINC01488 expression showed negative correlation to distant CpGs on the same chromosome in the TCGA-BRCA and OSLO2 cohorts (Fig. 5e). A specific negative correlation between LINC01488 expression and DNA methylation levels at a CpG (cg00211115) in an upstream active enhancer region is shown in Fig. 5f (OSLO2) and Fig. 5g (TCGA). This CpG has lower levels of methylation in ER positive patients and was found to reside within the binding sites of key transcription factors (ESR1, FOXA1, and GATA3, ChIP-seq). Furthermore, experimental long-range interactions defined by Pol2 binding (ChIA-PET Pol2 data), showed an interaction, loop, between the distal enhancer and LINC01488 TSS (Fig. 5e). LINC01488 was also detected in a long-range interaction with CCND1 (Fig. 5h) and showed significant correlation to CCND1 expression in both SCAN-B (Fig. 5i) and the TCGA-BRCA cohort (Supplementary Fig. 13). Other examples of lncRNAs with inverse correlation with DNA CpG methylation at enhancer sites that reside in long-range interactions are shown in Supplementary Figs. 13 and 14 and Supplementary Data 7, or lncRNAs in long-range interactions with protein coding mRNAs (Supplementary Data 7).

Altogether, these analyses show that integration of lncRNA expression with DNA methylation and long-range interaction data aids in identifying subtype-specific distal regulatory regions for lncRNA.

Read more here: Source link