Question on the fold change estimate

Dear Michael,

I have a raw count dataset with three conditions (A,B,C) and each condition has 2-3 biological replicates. I calculated the normalized counts (not log2 transformed) in DESeq2, and for instance the following lists the normalized count of a gene for each replicate.

#        A1        A2        A3        B1        B2        B3       C1       C2  
#108.74237   86.54679  109.39501  562.57695  570.28352  481.73985 1537.51446 1571.32569

I tried to estimate the fold change between C and A simply by calculating the ratio. That is, the mean of normalized counts in C divided by the mean of normalized counts in A is 15.31.

I also tried transforming the raw counts through the rlog method (set blind = FALSE), and the following gives the rlog values for the same gene in each replicate:

#        A1        A2        A3        B1        B2        B3       C1       C2 
# 7.956540  7.925518  7.983866  9.034676  9.043481  8.896545 10.052179 10.091817

I estimated the log2 fold change (C vs A) based on the rlog values, that, the mean of rlog values in C minus that in A. The resulting fold change estimate will be 4.34, much less than 15.31 above.

I understand that DESeq2 shrinks the log fold change to take care of genes of low counts, so it’s not simply the ratio of normalized counts. I also calculated the shrunken fold change (that is, 15.24) in DESeq2 and compared that to the one (15.31) I calculated previously, and they were very close.

After looking at the manual and paper, I thought log fold change shrinkage and the rlog method were conceptually consistent since they both tried to minimize the differences among samples of genes with low counts. But I am wondering why the fold change estimate using the rlog values will turn out to be so different. I would appreciate it very much that you would give any comments or correct my thought.

Also, the goal of my analysis to identify expression patterns of genes among the three conditions. I think transformed data from either rlog or vst would be a better choice to be used for the hierarchical clustering of the genes as the dependence of variance on the mean is removed and this might lead to more accurate clustering than using the normalized counts (without variance stabilization). Wondering if my understanding is correct.

Thanks so much for your help in advance!

Read more here: Source link