Differential Gene Expression

Can you analyze in GEO2R? => No, because this is RNA-seq and not microarrays.

You are lucky thought that the authors seem to provide raw counts so you can easily fede them into DESeq2.
Here is a code suggestion, for details please read the DESeq2 vignette extensively, it contains answers to 99.99% of all questions that might arise: bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

I also included some basic QC via Principal Component Analysis (PCA) and as you will see there is an obvious batch effect that needs to be corrected. Please go through the code, read the manual, try to understand what is going on. Be sure to really get a background.

This assumes that you downloaded the entire _RAW folder via the browser and then simply decompressed it by double-clicking on it, no magic here.



#/ download via the browser and uncompress the archive.
#/ we assume that the decompressed folder is in ~/Downloads/
listed_files <- list.files("~/Downloads/GSE162562_RAW", pattern=".gz", full.names=TRUE)

#/ load into R:
loaded <- lapply(listed_files, function(x) read.delim(x, header=FALSE, row.names = "V1"))

#/ bind everything into a single count matrix and assign colnames:
raw_counts <- do.call(cbind, loaded)
colnames(raw_counts) <- gsub(".txt.*", "", basename(listed_files))

#/ there is some nonsense in these files, probably due to the tool that was used (HTSeq?),
#/ such as rows with names "__no_feature", lets remove that:
raw_counts <- raw_counts[!grepl("^__", rownames(raw_counts)),]

#/ make a DESeq2 object. We can parse the group membership (Mild etc) from the colnames,
#/ which are based on the original filenames:
library(DESeq2)
dds <- DESeqDataSetFromMatrix(
  countData=raw_counts,
  colData=data.frame(group=factor(unlist(lapply(strsplit(colnames(raw_counts), split="_"), function(x) x[4])))),
  design=~group)

#/ some QC via PCA:
vsd <- vst(dds, blind=TRUE)
#/ plot the PCA:
plotPCA(vsd, "group")

#/ extract the data:
pcadata <- plotPCA(vsd, "group", returnData=TRUE)

#/ there is a very obvious batch effect, that we can correct, simply everything that is greater(less than zero in the first principal component, very obvious just by looking at the plot:
dds$batch <- factor(ifelse(pcadata$PC1 > 0, "batch1", "batch2"))

#/ lets use limma::removeBatchEffect to see whether it can be removed:
library(limma)
vsd2 <- vsd
assay(vsd2) <- removeBatchEffect(assay(vsd), batch=dds$batch)

#/ repeat PCA, looks much better. this means we modify our design to include batch and group,
#/ again, please see the vignette what this all means:
plotPCA(vsd2, "group")

design(dds) <- ~batch + group

#/ standard workflow, see vignette:
dds <- DESeq(dds)

#/ extract the results, here for example Mild vs Asymptom, using "ashr" shrinkage, see vignette:
library(ashr)
res_Mild_vs_Asymptom <- lfcShrink(dds=dds, contrast=c("group", "Mild", "Asymptom"), type="ashr")

This is the first PCA that reveals the batch effect:

enter image description here

…and the second one that demonstrates that the batch effect can be removed:

enter image description here

That is now a lot of code, but this is basically how you do a standard RNA-seq differential analysis.

Read more here: Source link