Gene Expression Analysis Steps ?

Hi everybody, I’m new in this field. I’m trying to replicate a paper to train my self . The results come out pretty the same but not exactly the same, so I wanted to know if all my steps are right or if I’m missing something ( or even completely doing it wrong ). I did follow edgeR guide for help.

Those are the steps ( I jump not important code lines ) :

1 – Prior to computation of differentially expressed transcripts (‘ANOVA testing for differences’), genes with less than five (non-normalized) read counts in all samples were excluded from analyses.

counts <- counts[!apply(counts<5,1,all,na.rm=TRUE),]

2 – Next, data normalization factors were calculated that are incorporated in further analyses by the weighted trimmed mean of M-values (”TMM”) method.

list <- DGEList(counts=counts,group=group)
list <- calcNormFactors(list,method = "TMM")

3 – After fitting of negative binominal models and both common, tagwise and trended dispersion estimates were obtained, differentially expressed transcripts were determined using a generalized linear model (GLM) likelihood ratio test.

design <- model.matrix(~ 0 + group)
list <- estimateDisp(list,design,robust = TRUE)
fit <- glmQLFit(list,design,robust = TRUE)
single.contrast <- makeContrasts(Tumor  - HC, levels = design)
test <- glmQLFTest(fit,contrast = single.contrast)

4 – To minimize the effect of batch effect, resulting in possibly false positive results, ‘sequencing batch’ was included as a covariate in the GLM likelihood ratio test design matrix. DONT KNOW HOW TO DO

5 – To determine differentially expressed transcripts, we only focused on transcripts with expression levels of logarithmic counts per million (LogCPM) > 3.

tags <- topTags(test, n=nrow(test$table))$table
tags.logcpm <- tagsA[tags$logCPM>3,]

At this point they obtain 5003 genes , I do obtain 4986 instead. But the logCPM values come out the same , just differences between FDR values (not so much tho).

Important Question : They do mention another thing that is the following -> To determine the specific input gene lists for the classifying algorithms we performed ANOVA testing for differences (as implemented in the R-package edgeR), yielding classifier-specific gene lists. Here they instead retrive 1072 genes that are actually inside my 4986 but I dont really understand what is this ANOVA testing for differences , and on what they did this (step 1? or after step 5?). This ANOVA testing is made to obtain genes for SVM classification so I assume they wanted genes with major differences.

Read more here: Source link