CeTF: an R/Bioconductor package for transcription factor co-expression networks using regulatory impact factors (RIF) and partial correlation and information (PCIT) analysis | BMC Genomics

CeTF is an C/C++ implementation in R for PCIT [6] and RIF [7] algorithms, which initially were made in FORTRAN language. From these two algorithms, it was possible to integrate them in order to increase performance and Results. Input data may come from microarray, RNA-seq, or single-cell RNA-seq. The input data can be read counts or expressions (TPM, FPKM, normalized values, etc.). The main pipeline (Fig. 1) consists of the following steps.

Fig. 1
figure1

CeTF workflow. From top to bottom the four main steps start with data adjustment, followed by a differential expression, Regulatory Impact Factors (RIF) analysis and ending with Partial Correlation and Information Theory (PCIT) analysis. The plots represent visualization examples that the package can generate (i.e. data distribution, smear plot, network, heatmap, circos plot)

Data adjustment

If the input data is a count table, data will be converted to TPM by each column (x) as follows:

$$ begin{aligned} TPM = frac{ 10^{6}x}{sum(x)} end{aligned} $$

(1)

The mean for TPM values different than zero and the mean values for each gene are used as a threshold to filter the genes. Genes with values above half of the previous averages will be considered for subsequent analyses. Then, the TPM data is normalized using:

$$ begin{aligned} Norm = frac{ log(x + 1)}{log(2)} end{aligned} $$

(2)

If the input already has normalized expression data (TPM, FPKM, etc), the only step will be the same filter for genes that consider half of the means.

Differential expression analysis

There are two options for differential analysis of the gene expression, the Reverter method [8] and DESeq2 [9]. In both methods, two conditions are required (i.e., control vs. tumor samples). In the Reverter method, the mean between samples of each condition for each gene is calculated. Then, subtraction is made between the mean of one condition concerning the other conditions. The variance of the subtraction is performed, then is calculated the difference of expression using the following formula, where s is the result of subtraction and var is the variance:

$$ begin{aligned} diff = frac{s – frac{sum(s)}{length(s)}}{sqrt{var}} end{aligned} $$

(3)

The DESeq2 method applies the Differential expression analysis based on the negative binomial distribution. Although both methods can be used on count data, it is strongly recommended to use only the Reverter method on expression input data.

Regulatory impact factors (RIF) analysis

The RIF algorithm is well described in the original paper [7]. This step aims to identify critical Transcription Factors calculating for each condition the co-expression correlation between the TFs and the Differentially Expressed (DE) genes (from previously item). The result is RIF1 and RIF2 metrics that allow the identification of critical TFs. The RIF1 metric classifies the TFs as most differentially co-expressed with the highly abundant and highly DE genes, and the RIF2 metric classifies the TF with the most altered ability to act as predictors of the abundance of DE genes. The main TF is defined if:

$$ begin{aligned} & sqrt{RIF1^{2}} & or & & sqrt{RIF2^{2}} & & > 1.96& end{aligned} $$

(4)

Partial correlation and information theory (PCIT) analysis

The PCIT algorithm is also well described in the original paper from Reverter and Chan [6]. Moreover, it has been used for the reconstruction of Gene Co-expression Networks (GCN). The GCN combines the concept of the Partial Correlation coefficient with Information Theory to identify significant gene-to-gene associations defining edges in the reconstruction of the network. At this stage, the paired correlation of three genes is performed simultaneously, thus making the inference of co-expressed genes. This approach is more sensitive than other methods and allows the detection of functionally validated gene-gene interactions. First, is calculated for every trio of genes x, y, and z the partial correlation coefficients:

$$ begin{aligned} r_{xy,z} = frac{r_{xy} – r_{xz}r_{yz}}{ sqrt{(1 – r^{2}_{xz})(1 – r^{2}_{yz})} } end{aligned} $$

(5)

And similarly, for rxz,y and ryz,x. After that, for each trio of genes is calculated the tolerance level (ε) to be used as a threshold for capturing significant associations. The average ratio of partial to direct correlation is computed as follows:

$$ begin{aligned} varepsilon = frac{1}{3} left(frac{r_{xy,z}}{r_{xy}} + frac{r_{xz,y}}{r_{xz}} + frac{r_{yz,x}}{r_{yz}} right) end{aligned} $$

(6)

The association between the genes x and y is discarded if:

$$ begin{aligned} & |r_{xz}| leq |varepsilon r_{xz}| & and & & |r_{xy}| leq |varepsilon r_{yz}| end{aligned} $$

(7)

Otherwise, the association is defined as significant, and the interaction between the genes x and y is used in the reconstruction of the GCN. The final output includes the network with gene-gene and gene-TF interactions for both conditions, besides generating the main TFs identified in the network.

Functions of the package

There are 28 functions and 5 example datasets available in CeTF, which are described in Table 1. A working example for each of these functions is given in the package documentation in the Supplementary Material. The package allows the integration with many other packages and different types of genomics/transcriptomics analysis.

Table 1 Functions available in CeTF

Additional functionalities

The CeTF package also includes additional features in order to visualize the results. After running PCIT and RIF analysis, it is possible to plot the data distribution, the distribution of differentially expressed genes/TFs that shows the average expression (in log2) by the difference of expression, the network for both conditions and the integrated network with genes, TFs and enriched pathways. Besides, it is possible to visualize the targets for specific TFs as a circos plot. It is also possible to perform the grouping of ontologies [10] without statistical inference and functional enrichment for several databases with the statistical inference of many organisms using WebGestalt database [11]. Finally, it is possible to save all tables that include interaction networks, enrichment, differential expression, main TFs, and others.

Software construction

CeTF is an R-based toolkit, and most of the code is written in R language. PCIT and tolerance functions were written in C/C++ using Rcpp (v1.0.5) [12] and RcppArmadillo (v0.10.1.2.2) [13] for better performance. The main R packages used for analysis and visualization of the results were the circlize (v0.4.10) [14], ComplexHeatmap (v2.6.0) [15], DESeq2 (v1.30.0) [9], ggplot2 (v3.3.2) [16], RCy3 (v2.10.0) [17], and others listed in the Supplementary Material.

Read more here: Source link