Hi people,
Currently, I’m working on analyzing ATACseq data with the following pipeline:
-
fastQC, adapter trimming, etc.
-
BWA mem
-
Remove duplicates, filter MAPQ
-
Call peaks with Genrich for all samples
-
Merge all sample peaks into reference peak set
-
Count Tn5 cut sites with featureCounts
At this point I have a count matrix, but have questions about normalization.
When normalizing for differential accesibility analysis, is the effective library size the total qualified reads, or just the reads counted in peaks?
Should data be normalized prior to peak calling, in order to identify peaks that may have been lost to noise?
Many papers report normalizing peak counts with CPM, however there are many forum posts that report CPM as inadequate to properly normalize ATAC data. Some people suggest edgeR TMM, but I believe edgeR is only normalizing with reads that fall in peaks which seems biologically incorrect.
This data has paired RNAseq data, however I’m still waiting on the sequencing data.
Read more here: Source link