I have a quick RNAseq (Quantseq) question for you all!
I am analyzing the Quantseq data for 500 patients and am finding my way through the bioinformatic forest.
Currently I am working on a way to filter out lowly expressed genes, and I am using the Bioconductor package in R to do so. I have thought of a way to do this, but I dont know if I am completely right in doing so and comments are greatly appreciated.
I am planning on filtering lowly expressed genes by CPM, my library sizes range from 1.7M to 7.9M. I want to keep genes that have >10 counts, but I want filter using CPM instead of raw counts as this also corrects for libsize. Can I just use the following formula “raw counts cutoff”https://www.biostars.org/”minimum libsize in millions” = “CPM threshold”? So this would relate to a CPM filter value of 10/1.7 = 5.9?
As I am thinking of this, this seems to me as there is a lot of ‘data’ wasted, as there a bigger libraries in the dataset, but only the smallest library determines the cutoff for filtering. This means that in some libraries genes are discarded that have a count of >10. Would using another cut-off, for example the mean library size not be a better cutoff?
After this I want to keep genes with a CPM >5.9, in the manual from the EdgeR package they select if a CPM value is available in 2 or more rows as they use 2 biological replicates. As I dont have any biological replicates can I just select the genes with a CPM of 5.9 in any of the samples?
Any guidance through the forest would be greatly appreciated!