lncRNA expression according to breast cancer clinicopathological subtypes
To identify lncRNAs expressed by specific breast cancer subtypes or associated with clinicopathological features, we analyzed RNA-sequencing data from two large independent breast cancer cohorts: SCAN-B (n = 3455)17 and TCGA-BRCA (n = 1095).
We focused on lncRNAs annotated in the Ensembl18 v93 non-coding reference transcriptome (Supplementary Fig. 1), and identified 4108 lncRNAs present in both cohorts, which are further analyzed in this study. A small number of lncRNAs (100 in SCAN-B, 37 in TCGA) were expressed >1 transcript per million (TPM) in all patients, but the majority of lncRNAs were expressed at lower levels. The two cohorts differ in the distribution of patients expressing lncRNAs>1TPM (Supplementary Fig. 2). Such sparsity of the lncRNA expression in the two cohorts highlights the importance of analyzing at least two independent breast cancer cohorts to robustly identify the lncRNA associated with clinicopathological features. Hierarchical clustering of the log2 expression values of the 4108 lncRNAs clearly grouped ER positive from ER negative patients, (Fig. 1a: SCAN-B and Fig. 1b: TCGA), indicating an association between breast cancer subtypes and lncRNA expression.
To further identify the lncRNAs associated with breast cancer subtypes, we performed differential expression analysis using linear modeling and empirical Bayes moderation. We report lncRNAs with significant differential expression according to the ER status (Fig. 1c) and HER2 status (Fig. 1d). The significant Pearson correlation between the log fold change (FC) in the SCAN-B and the TCGA cohorts (r = 0.93, p-value < 2.2e-16: ER status and r = 0.75, p-value < 2.2e-16: HER2 status) show that we identify with high confidence lncRNA differentially expressed according to pathological breast cancer subtypes.
On each plot (Fig. 1c, d), we highlight the lncRNAs with the highest absolute fold changes in each breast cancer subgroup. Detailed results from the differential expression analysis are available in Supplementary Data 1. FOXCUT was the most significantly deregulated lncRNA over-expressed in ER negative tumors with the highest fold change in both SCAN-B and TCGA, it has been previously shown to enhance proliferation and migration in ER negative breast cancer cell lines19.
We further performed all pairwise differential expression analyses within the five molecular PAM50 subtypes, Luminal A, Luminal B, Her2-enriched, Basal-like and Normal-like. Figure 1e shows the results of such analysis for Luminal A versus Luminal B, two subtypes considered to be closely related, as they are both ER positive; however, we still report 1448 differentially expressed lncRNA between these two subtypes. All pairwise comparisons considering PAM50 subtypes are presented in Supplementary Fig. 3 and Supplementary Data 1.
Few lncRNAs have been associated to patient outcome20. To assess the relevance of lncRNA expression robustly and systematically with regards to breast cancer prognosis, we performed Cox proportional hazards regression analysis in the SCAN-B cohort in ER + and ER patients separately. 305 lncRNAs were significantly associated to overall survival of ER + patients in the SCAN-B cohort, of which MAPT-AS1, AP000851.1, AP000851.2, and ROCR could be validated in TCGA-BRCA (Supplementary Data 2). MAPT-AS1 has been previously shown to be associated with better patient outcome in breast cancer patients21. ROCR, the lncRNA with highest expression in the Luminal A subtype was also associated with ER + prognosis. 77 lncRNAs were associated to overall survival within the ER- group in the SCAN-B cohort, however, none of these were significantly associated with survival after multiple testing correction in the TCGA-BRCA cohort.
To our knowledge, this initial analysis is the first to robustly identify differentially expressed lncRNAs according to breast cancer clinicopathological features and molecular subtypes in two large and independent cohorts.
Clustering lncRNAs according to high degree of co-expression with protein coding mRNAs
To associate lncRNA expression to known biological functions, we used a co-expression approach (Supplementary Fig. 4a) between lncRNA (n = 4108) and protein coding genes´ mRNA (n = 17060). Retaining the significant Spearman correlation coefficients of all lncRNA-mRNA associations in both cohorts (Bonferroni corrected p-value < 0.05), led to n = 15407856 significant correlations. On average, each lncRNA was significantly correlated with the expression of 95 mRNAs (Supplementary Fig. 4b), while each mRNA was on average correlated with 20 lncRNAs (Supplementary Fig. 4c). Among the lncRNAs associated to the expression of the highest number of mRNAs, we found a non-coding RNA activated by DNA damage (NORAD), known to regulate genome stability22, as well as other lncRNAs with known function in DNA-damage response, including the estrogen responsive LINC0148823.
We then performed unsupervised clustering of the significant correlations with an absolute Spearman Rho >0.4 and involving lncRNAs and mRNAs with more than the average number of significant correlations (Supplementary Fig. 4b, c). All significant correlations fulfilling these criteria are denoted in Supplementary Data 3. We identified three lncRNA clusters (x-axis) which correlated with three mRNA clusters (y-axis) (Fig. 2a). Interestingly, most of the correlation coefficients (99.8%) were positive, showing more positive correlations between mRNA and lncRNA than expected by chance. To assess whether the discovery of our three biclusters was driven by the filtering criteria used to select lncRNA and mRNA, an unsupervised clustering including all lncRNAs and mRNAs allowed the rediscovery of the three biclusters, however with much more sparsity (Supplementary Fig. 5).
To link the clustered lncRNAs to the differential expression analysis performed according to breast cancer subtypes, we annotated lncRNAs according to whether they were found overexpressed in the respective groups compared in Fig. 1c–e (column annotations of the heatmap). We observed that lncRNA-cluster 1 and 3 were populated by lncRNAs overexpressed in ER positive and ER negative cases, respectively.
Grouping lncRNAs into pathways related to breast cancer pathogenesis
Following the unsupervised clustering (Fig. 2a), we found a high degree of significant and dominantly positive correlations between the (i) lncRNAs in cluster 1 and the mRNAs in cluster A, (ii) lncRNA-cluster 2 and mRNA-cluster B and (iii) lncRNA-cluster 3 and mRNA-cluster C. By performing Gene Set Enrichment Analysis (GSEA) using the genes of each mRNA-cluster as input, we could infer by association the pathways the lncRNA-clusters may functionally be associated with.
lncRNA-cluster 1 & mRNA-cluster A – Estrogen signaling cluster
91% of the lncRNAs in cluster 1 were significantly overexpressed in ER positive cases, when compared to ER negative, associating these lncRNAs with estrogen signaling. Further, we found that GSEA analysis of genes in mRNA-cluster A were significantly associated with the estrogen signaling pathway (Fig. 2b, Supplementary Data 3).
lncRNA cluster 2 & mRNA-cluster B – Fibroblast cluster
52% of the lncRNAs in cluster 2 were significantly overexpressed in ER positive and 21% in ER negative cases. Interestingly, 87% were overexpressed in Luminal A tumors when compared to Luminal B.
According to GSEA, genes of mRNA-cluster B are involved in Epithelial to Mesenchymal Transition (EMT) and Apical junctions (Fig. 2c, Supplementary Data 3). There is a high similarity between mesenchymal cells and fibroblasts24, and fibroblasts are strongly associated with shaping of the extra cellular matrix25. In addition, fibroblasts have been shown to be highly abundant in Luminal A breast tumors26. We therefore hypothesized that lncRNAs from cluster 2 may be expressed by fibroblasts of the tumor microenvironment.
lncRNA cluster 3 & mRNA-cluster C – Immune cluster
For the third lncRNA-cluster, 61% of the lncRNA were overexpressed in ER negative tumors. Protein coding genes of mRNA cluster C were highly correlated with lncRNA-cluster 3 and enriched among pathways reflecting activation of the immune system (Fig. 2d, Supplementary Data 3). Given the fact that ER negative tumors have significantly higher immune infiltration than ER positive tumors27, we hypothesized that lncRNAs from cluster 3 may be expressed by immune cells and / or modulate the immune tumor microenvironment.
Predicting cell type expression of lncRNAs
Having set hypotheses on which pathways and cell types clustered-lncRNA may be associated with, we aimed at providing further evidence for the cell type specific expression of lncRNAs using different approaches.
First, we modeled lncRNA expression as a multivariate function of ESR1 mRNA expression, fibroblast and lymphocyte infiltration scores reflecting fibroblast or lymphocyte tumor content. We tested which of the three variables explained best each lncRNA’s expression (Supplementary Data 4). To infer the lymphocyte and fibroblast content and calculate lymphocyte and fibroblast scores, we used bulk gene expression and the Nanodissect [23] or xCell28 algorithms, respectively (see Methods).
We found that ESR1 expression, fibroblast score, and lymphocyte score were the most significant explanatory variables for 82% of lncRNAs in cluster 1, 60% of lncRNAs in cluster 2 and 84.5% of lncRNA in cluster 3, respectively. Furthermore, when comparing the logistic regression coefficients, which reflect how much each variable explains lncRNA expression, we found that in average the ESR1-coefficients were significantly higher in cluster 1 (Fig. 2e, SCAN-B and Supplementary Fig. 6a, TCGA), the fibroblast-coefficients significantly higher in cluster 2 (Fig. 2f, SCAN-B and Supplementary 6b, TCGA) and the lymphocyte coefficient significantly higher in cluster 3 (Fig. 2g, SCAN-B and Supplementary Fig. 6c, TCGA).
These detailed analyses clearly divide lncRNA expressed in breast cancer in three categories, they are either expressed by cancer cells and belong to the estrogen signaling pathways or they are expressed by the main cell types of the tumor microenvironment: lymphocytes and fibroblasts.
Expression of lncRNAs in breast cancer cell lines and single cell RNA-seq data
To clearly associate lncRNAs with cell type specific expression in breast cancer, we investigated lncRNA expression in a panel of breast cancer cell lines29. Differential expression analysis of breast cancer cell lines according to molecular subtypes confirmed that the lncRNAs with significantly high expression in Luminal A and Luminal B (ER + ) cell lines belonged to cluster 1. The top three lncRNAs for both Luminal subtypes are shown in Supplementary Fig. 7a, b. Overall, cluster 1 lncRNAs were expressed at higher levels in the Luminal cell lines (Supplementary Fig. 7c, d). Cluster 2 lncRNAs, which we identify as mainly being expressed in fibroblasts of the tumor microenvironment, showed highest expression in cell lines of the Normal-like subtype. In cluster 3, 20% of the lncRNAs were not expressed in any breast cancer cell lines, but the remaining cluster 3 lncRNAs had the highest expression in Basal, Claudin-low, and Normal-like, ER- cell lines (Supplementary Fig. 7e–h). All the lncRNAs that significantly defined each subtype cell lines from the rest are included in Supplementary Data 5.
We next interrogated single cell RNA sequencing data from a study of 26 breast cancer patients30. Following dimensionality reduction and clustering of the 94357 cells from the study by Wu et al., we observed that the cluster of cells obtained overlapped perfectly with the cell type annotation published by the authors30, which included nine main cell types: normal epithelial, cancer epithelial, myeloid, T-cells, B-cells, endothelial cells, plasmablasts, Cancer Associated Fibroblasts (CAFs) and perivascular-like (PVL)-fibroblasts (Fig. 3a).
Dot plot analysis which reflects both average expression and percentage of cells expressing lncRNAs was performed for the lncRNAs with the highest logistic regression coefficient associated with each cluster characteristic feature (i.e, ESR1 expression for cluster 1, fibroblast infiltration for cluster 2, and immune infiltration for cluster 3) (Fig. 3b, d). We confirmed that lncRNAs of cluster 1 were expressed at higher levels in cancer epithelial cells, cluster 2-lncRNAs were mainly expressed by cancer associated fibroblasts, while lncRNAs of cluster 3 were expressed by immune cells. We further illustrate the expression of GATA3-AS1 (Fig. 3e), NR2F1-AS1 (Fig. 3f) and LINC00861 (Fig. 3g) on a Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP). LINC00861 has been shown to be expressed in T-Cells in the tissue microenvironment (TME) of lung adenocarcinoma patients and was associated with better prognosis31. This lncRNA was also associated with better outcome in ER- patients in the SCAN-B cohort (Supplementary Data 2). Additional illustrations of lncRNA expression on the UMAP are included as Supplementary Fig. 8.
With these analyses, we directly identified lncRNA expression in either breast cancer cells, including cell lines, immune cells, or fibroblast of the breast tumor microenvironment.
Transcriptional regulation of expression at lncRNA promoters
lncRNAs are typically co-expressed with protein coding mRNA neighboring genes4. We aimed at characterizing lncRNAs regulatory regions in breast cancer.
To focus only on lncRNA specific regulatory regions and avoid analyzing regulatory regions from protein coding genes, we selected lncRNAs for which the promoter regions (transcription start site (TSS) −200/+100 bp) did not overlap with protein coding genes (Fig. 4a, Supplementary Data 6). Indeed, lncRNAs with promoters overlapping with protein coding genes had a higher level of co-expression with neighboring protein coding genes than independent lncRNAs and the nearest protein coding mRNA (Supplementary Fig. 9). We therefore further analyzed the promoters of lncRNA with no overlap with protein coding gene loci; either promoters of lncRNAs overexpressed in ER positive (n = 2320) or ER negative (n = 536) samples. We compared these two groups of promoters with respect to i) Chromatin accessibility measured by ATAC-seq in 74 TCGA-BRCA patients, ii) ChromHMM, chromatin genome segmentation, and iii) Transcription Factor (TF) – binding sites using the UniBind database32.
lncRNA promoters are accessible in an ER-status specific manner
We found lncRNA promoters to be accessible in a lineage specific manner, i.e. promoters of lncRNA overexpressed in ER positive tumors were more open (higher Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) signal) in ER positive samples than in ER negative samples. Similarly, promoters of lncRNAs over-expressed in ER negative tumors showed significantly higher ATAC-seq signal in ER negative samples (Fig. 4b, c, Supplementary Data 6), suggesting that lncRNA promoters are highly regulated in a subtype specific manner.
lncRNA promoters are defined as active regions according to chromHMM
We assessed whether lncRNA promoters were enriched for specific chromHMM regions defined in subtype specific breast cancer cell lines. We mainly observed significant enrichment for ‘Promoter Flanking’ and ‘Enhancer’ (Fig. 4d, e, Supplementary Data 6). When expanding the window upstream of the TSS, the enrichment for ‘Enhancer’ marks became even more significant, with the lncRNAs over-expressed in ER negative tumors showing particularly significant overlap with ‘Enhancer’ marks in Basal like cell lines (Supplementary Fig. 10).
Specific transcription factors binding sites are found at lncRNA promoters
We next sought for enrichment in transcription factor binding sites (TFBS) in lncRNAs promoters using the UniBind database32. UniBind stores TFBSs with both experimental and computational evidence for direct TF-DNA interactions. We found ER + lncRNA promoters enriched for FOXA1 and ESR1 binding sites; TFs known to drive ER positive breast cancer (Fig. 4f, Supplementary Data 6). On the other hand, promoters of the lncRNAs highly expressed in ER negative tumors were enriched for BATF3, MAF, and RELA, components of the NF-κB TF complex, shown to be constitutively active in triple negative breast cancer33 (Fig. 4g, Supplementary Data 6).
To further assess the specificity of the TF binding according to length of the promoter chosen for the lncRNA, we assess three different sizes of promoters: TSS −300/+100 bp, TSS −500/+100 bp, TSS −1000/+100 bp. For ER + lncRNAs binding of ESR1 and FOXA1 dominated for all window sizes (Supplementary Fig. 11a–c). When extending the window upstream of the TSS for ER- lncRNA there was also enrichment for CEBPB, a transcription factor involved in inflammatory response34, and several additional AP-1 family members with known function in dendritic cell identity35 (Supplementary Fig. 11d–f).
Altogether, these results gave insight into the regulatory programs specifically at lncRNA promoters and showed that this regulation is closely related to estrogen receptor status in breast cancer.
Identifying distal regulatory regions for lncRNA
Finally, we sought for distal regulatory regions for lncRNA in breast cancer. We used our previously published method36, which is efficient at identifying distal enhancer and long-range interactions between enhancers and promoters through negative correlations between DNA methylation and transcript expression. We correlated the levels of DNA methylation at CpGs and lncRNA expression for all CpGs and lncRNAs on the same chromosome in two cohorts for which DNA methylation and lncRNA expression were available TCGA-BRCA (n = 603) and OSLO2 (n = 279). As the OSLO2 lncRNA expression was measured by Agilent microarray 60 K, we focused on 1027 lncRNAs found in both cohorts (Supplementary Fig. 12). For both cohorts, we identified 26342 CpGs significantly inversely correlated with 396 lncRNA (Bonferroni corrected Spearman correlation p-value < 0.05). We first tested in which chromHMM regions the CpGs whose DNA methylation was inversely correlated with lncRNAs were located and found them significantly enriched in enhancer regions (Fig. 5a, Supplementary Data 7). CpGs negatively correlated with lncRNAs highly expressed in ER positive tumors were found in open chromatin regions significantly more open in ER positive samples according to the TCGA-BRCA ATAC-seq data (Fig. 5b). Correspondingly, CpGs negatively correlated with lncRNAs highly expressed in ER negative breast cancer were found in regions significantly more open in ER negative tumors (Fig. 5c). Further confirming that the CpG in cis inverse correlation with lncRNA expression pointed at biologically relevant and active distal regulatory regions, we found such CpGs near binding sites of TFs described at breast cancer enhancers (Fig. 5d).
The LINC01488 locus provides a good illustration of distal regulatory regions possibly involved in the regulation of lncRNA expression (Fig. 5e). LINC01488 expression showed negative correlation to distant CpGs on the same chromosome in the TCGA-BRCA and OSLO2 cohorts (Fig. 5e). A specific negative correlation between LINC01488 expression and DNA methylation levels at a CpG (cg00211115) in an upstream active enhancer region is shown in Fig. 5f (OSLO2) and Fig. 5g (TCGA). This CpG has lower levels of methylation in ER positive patients and was found to reside within the binding sites of key transcription factors (ESR1, FOXA1, and GATA3, ChIP-seq). Furthermore, experimental long-range interactions defined by Pol2 binding (ChIA-PET Pol2 data), showed an interaction, loop, between the distal enhancer and LINC01488 TSS (Fig. 5e). LINC01488 was also detected in a long-range interaction with CCND1 (Fig. 5h) and showed significant correlation to CCND1 expression in both SCAN-B (Fig. 5i) and the TCGA-BRCA cohort (Supplementary Fig. 13). Other examples of lncRNAs with inverse correlation with DNA CpG methylation at enhancer sites that reside in long-range interactions are shown in Supplementary Figs. 13 and 14 and Supplementary Data 7, or lncRNAs in long-range interactions with protein coding mRNAs (Supplementary Data 7).
Altogether, these analyses show that integration of lncRNA expression with DNA methylation and long-range interaction data aids in identifying subtype-specific distal regulatory regions for lncRNA.
Read more here: Source link