What you describe seems to be a difference in signal-to-noise ratio which is not uncommon.
This is where more elaborate normalization methods such as TMM from edgeR
or RLE from DESeq2
come into play.
See the following links on why these methods are superior. The videos talk about RNA-seq but simply consider genes as “regions” in any assay such as ATAC-seq or ChIP-seq:
Also, this is where more simple methods such as TPM/RPKM/FPKM which only correct for total library read count
(=sequencing depth) typically make limited sense as they fail to correct for the systematic differences in library composition due to the differences in either the peak landscape or signal/noise ratio. I cannot comment on Quantile Normalization as I have no experience with it but I had good results with both TMM and RLE. Therefore my recommendation to routinely use something like edgeR
to calculate scaling factors which you then use to divide the column 4 of your bedGraph/bigwig by.
BedGraph/Bigwig creation:
Among other tools this could be done by bamCoverage
from deeptools or genomecov
from bedtools. Both tools start from BAM files and offer plenty of options for customization.
Scaling Factors:
Here is a code suggestion in R
to calculate scaling factors to scale the bedGraph/bigwig files. It assumes you have already created a count matrix with rows = regions and columns = samples, e.g. based on a list of peaks called from the underlying experiment. This could be done with tools such as macs2
or Genrich
followed by featureCounts
. It is assumed that the matrix of raw counts is now loaded in R
as object raw.counts
:
## edgeR:: calcNormFactors
NormFactor <- calcNormFactors(object = raw.counts, method = "TMM")
## if you prefer to use the DESeq2 strategy use method="RLE" instead
## raw library size:
LibSize <- colSums(raw.counts)
## calculate size factors:
SizeFactors <- NormFactor * LibSize / 1000000
## Reciprocal, please read section below:
SizeFactors.Reciprocal <- 1/SizeFactors
SizeFactors
will now contain a factor for every sample which can be used to divide the 4th colun of a bedGraph/bigwig by.
Both the aforementioned tools (bamCoverage
and genomecov
) have options though to directly scale the output by a factor (--scaleFactor
or --scale
respectively).
!! Note though that these options will multiply rather than divide the 4th column with the factor, therefore you would need to provide the reciprocal as mentioned in the code chunk above.
Illustration:
Here is a picture I made a while ago addressing exactly this problem. It shows two biological replicates with different signal-to-noise ratios (replicate two has worse SN therefore overall lower peak counts as many reads fall into non-peak regions) for a region which I know is not differential. Using CPM (so per-million scaling based on only the read counts) shown in the top panel fails to correct for the systematic difference while TMM on the counts in peak regions (bottom panel) manages to correct for it. The middle panel is a bin-based normalization with edgeR suggested in the csaw
package which might in certain situations be beneficial if library composition is very different between samples (check csaw vignette for details if you want), but lets not further discuss this here as it does not add to the topic right now.
Bigwigs in R:
If you want to visualize bigwigs in R one might want to check this handy tool that was recently posted here which seems to be easy to use yet powerful and visually appealing:
Read more here: Source link