Analysing Microarray Data In Bioconductor

I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.

I might have also included: filtering with ‘genefilter’, multiple testing correction with ‘multtest’, and visualizations with ‘heatmap.2’. Let me know if you would like to see tutorials specifically on any of those topics.

UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the ‘Affymetrix Human Gene 1.0 ST Array’.

#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")

#Load the necessary libraries
library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)

#Set working directory for download
setwd("/Users/ogriffit/Dropbox/BioStars")

#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")

#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="https://www.biostars.org/"), gunzip)
cels = list.files("data/", pattern = "CEL")

setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor

#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)

#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)

#Format values to 5 decimal places
rma=format(rma, digits=5)

#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)

#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))

#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)

#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "t", row.names = FALSE, col.names = TRUE)

This produces a tab-delimited text file of the following format. Note that many probes will have “NA” for gene symbol and Entrez ID.

probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL …

7897441 H6PD 9563 6.5943 7.0552 7.5201 …

7897449 SPSB1 80176 6.9727 7.0281 7.2285 …

7897460 SLC25A33 84275 7.6659 7.4289 7.9707 …

Read more here: Source link