Difference between revisions of "Using Bioconductor To Analyse Beadarray Data"

From Bridges Lab Protocols
Jump to: navigation, search
m
m
 
(12 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[Category: R]]
 
[[Category: R]]
 
[[Category: Bioinformatics]]
 
[[Category: Bioinformatics]]
 +
[[Category: Bioconductor]]
  
 
==Software Requirements==
 
==Software Requirements==
*R, get from [[http://cran.r-project.org/ CRAN]]
+
*R, get from [http://cran.r-project.org/ CRAN]
*Bioconductor, get from [[http://www.bioconductor.org/download Bioconductor]]
+
*Bioconductor, get from [http://www.bioconductor.org/download Bioconductor]
 
*Bioconductor packages.  Install as needed:
 
*Bioconductor packages.  Install as needed:
 
**beadarray
 
**beadarray
 
**limma
 
**limma
 +
**annotation data for the array (normally illuminaMousev2BeadID.db)
 +
*To install bioconductor packages use:
 
<pre>
 
<pre>
 
source("http://www.bioconductor.org/biocLite.R")
 
source("http://www.bioconductor.org/biocLite.R")
Line 47: Line 50:
 
*First define groups for each treatment.  If a samplesheet was provided correctly and had this information:
 
*First define groups for each treatment.  If a samplesheet was provided correctly and had this information:
 
<pre>samples = pData(BSData)$Sample_Group</pre>
 
<pre>samples = pData(BSData)$Sample_Group</pre>
*Otherwise define these groups manually in the order that they were entered (check by looking at pData(BSData)
+
*Otherwise define these groups manually in the order that they were entered, check by looking at pData(BSData)
 
<pre>samples = c("Control", "Control", "Treatment1", "Treatment1, "Treatment2"...)</pre>
 
<pre>samples = c("Control", "Control", "Treatment1", "Treatment1, "Treatment2"...)</pre>
 
*Next the groups are used to set up a statistical design:
 
*Next the groups are used to set up a statistical design:
Line 64: Line 67:
 
</pre>
 
</pre>
 
===Generating a Venn Diagram for Differential Expression===
 
===Generating a Venn Diagram for Differential Expression===
*First define a cutoff criteria for inclusion.
+
*First define a cutoff criteria for inclusion. One option is to use the decideTests function:
**One option is to use the decideTests function:
+
<pre>results = decideTests(ebFit)</pre>
</pre>results = decideTests(ebFit, method="global")</pre>
+
*The relevant options are for method and adjust.method
**The relevant options are for method and adjust.method
+
**method
***method
+
***default is "global", which allows for p-value comparasons
****default is "global", which allows for p-value comparasons
+
**adjust.method, this defines the false-discovery rate adjustment:
***adjust.method, this defines the false-discovery rate adjustment:
+
***default is "BH" for Benjami and Hochberg
****default is "BH" for Benjami and Hochberg
+
***other options are "none", "fdr" (same as BH), "holm" and "BY"
****other options are "none", "fdr" (same as BH), "holm" and "BY"
+
*Now use that classification to generate the Venn Diagram.  The following will include both up and downregulated genes and color the numbers accordingly:
 +
<pre>vennDiagram(results, include="both", col=c("red","green")</pre>
 +
 
 
==Annotation of Expression Sets and Fitted Data==
 
==Annotation of Expression Sets and Fitted Data==
 +
*To see all possible annotation criteria use:
 +
<pre>
 +
library(illuminaMousev2BeadID.db)
 +
illuminaMousev2BeadID()
 +
</pre>
 +
*Normally you want to annotate with at least the gene symbol and gene name.  Add other criteria as required
 +
<pre>
 +
ids = rownames(exprs(BSData))
 +
GeneName = mget(ids, illuminaMousev2BeadIDGENENAME, ifnotfound = NA)
 +
symbol = mget(ids, illuminaMousev2BeadIDSYMBOL, ifnotfound = NA)
 +
anno = cbind(GeneSymbol = as.character(symbol), GeneName = as.character(GeneName))
 +
</pre>
 +
*To add this annotation to the data analysis file:
 +
<pre>
 +
ebFit$genes = anno
 +
write.fit = (ebFit, file = "Filename.csv", adjust="BH")
 +
</pre>
 +
*This example includes a false discovery rate ("BH") adjusted p.value.
 +
*This function writes a tab-delimited text file containing for each gene (1) the average log-intensity, (2) the log-ratios, (3) moderated t-statistics, (4) t-statistic P-values, (5) F-statistic if available, (6) F-statistic P-values if available, (7) classification if available.
 +
*To add this annotation data to the expression set:
 +
<pre>
 +
data = exprs(BSData.quantile)
 +
data = cbind(anno,data)
 +
write.csv = (data, file = "Filename.csv")
 +
</pre>
 +
*Remember that the expression set is Log2 adjusted, so to look at absolute expression levels use 2^value.

Latest revision as of 15:34, 2 September 2009


Software Requirements

  • R, get from CRAN
  • Bioconductor, get from Bioconductor
  • Bioconductor packages. Install as needed:
    • beadarray
    • limma
    • annotation data for the array (normally illuminaMousev2BeadID.db)
  • To install bioconductor packages use:
source("http://www.bioconductor.org/biocLite.R")
biocLite("PACKAGE")

Loading Data

  • At a minimum you need the Probe Profile data (normally a txt file).
  • For all R procedures first change directory to your working directory then next create a new script, and save all executed lines in that script file.
  • Load the beadarray library, indictate dataFile (required), sampleSheet (normally a xls or csv file) and control set (Control Probe, normally a txt file)
data = "FinalReport_SampleProbe.txt"
controls = "ControlProbe.txt"
samplesheet = "Proj_54_12Aug09_WGGEX_SS_name.csv"
BSData = readBeadSummaryData(dataFile = data, qcFile= controls, sampleSheet=samplesheet)
  • You may need to alter either the ProbeID or ControlID to fit the illuminaprobe column from the sampleprobe or controlprobe datasets.
  • This fits the data into the BSData dataframe. Phenotype data can be accessed by pData(BSData) and expression data can be accessed by exprs(BSData).

Data Normalisation

  • Microarray data is typically quantile normalised and log2 transformed:
BSData.quantile = normaliseIllumina(BSData, method="quantile", transform="log2")
  • To examine the effects of normalisation on the dataset use boxplots:
boxplot(as.data.frame(log2(exprs(BSData))),las=2,outline=FALSE, ylab="Intensity (Log2 Scale)")
boxplot(as.data.frame(exprs(BSData.quantile)),las=2,outline=FALSE, ylab="Intensity (Log2 Scale)")
  • Save these boxplots as postscript files.

Clustering Analysis

  • This analysis will generate a euclidean distance matrix then a cluster analysis of that matrix and will show the distribution between replicates. Ideally similar treatments will cluster together.
d = dist(t(exprs(BSData.quantile)))
plot(hclust(d)

Differential Expression Analysis

  • Normalised data can be analysed using the limma package for statistical differences
  • First define groups for each treatment. If a samplesheet was provided correctly and had this information:
samples = pData(BSData)$Sample_Group
  • Otherwise define these groups manually in the order that they were entered, check by looking at pData(BSData)
samples = c("Control", "Control", "Treatment1", "Treatment1, "Treatment2"...)
  • Next the groups are used to set up a statistical design:
library(limma)
samples = as.factor(samples)
design = model.matrix(~0 + samples)
colnames(design) = levels(samples)
fit = lmFit(exprs(BSData.quantile), design)
  • Now set up contrast matrices to define how you want the data analyses. For example you may want to compare some treatments to a control, as well as between some treaments. See the limma user guide for more information about specific analyses. When defining the contrast matrix use the sample group names as defined above.
cont.matrix = makeContrasts(Treatment1vsControl = Treatment1 - Control, Treatment2vsControl = Treatment2 - Control, Treatment1vsTreatment2 = Treatment1 - Treatment2, levels = design)
fit.cont = contrasts.fit(fit, cont.matrix)
ebFit = eBayes(fit.cont)

Generating a Venn Diagram for Differential Expression

  • First define a cutoff criteria for inclusion. One option is to use the decideTests function:
results = decideTests(ebFit)
  • The relevant options are for method and adjust.method
    • method
      • default is "global", which allows for p-value comparasons
    • adjust.method, this defines the false-discovery rate adjustment:
      • default is "BH" for Benjami and Hochberg
      • other options are "none", "fdr" (same as BH), "holm" and "BY"
  • Now use that classification to generate the Venn Diagram. The following will include both up and downregulated genes and color the numbers accordingly:
vennDiagram(results, include="both", col=c("red","green")

Annotation of Expression Sets and Fitted Data

  • To see all possible annotation criteria use:
library(illuminaMousev2BeadID.db)
illuminaMousev2BeadID()
  • Normally you want to annotate with at least the gene symbol and gene name. Add other criteria as required
ids = rownames(exprs(BSData))
GeneName = mget(ids, illuminaMousev2BeadIDGENENAME, ifnotfound = NA)
symbol = mget(ids, illuminaMousev2BeadIDSYMBOL, ifnotfound = NA)
anno = cbind(GeneSymbol = as.character(symbol), GeneName = as.character(GeneName))
  • To add this annotation to the data analysis file:
ebFit$genes = anno
write.fit = (ebFit, file = "Filename.csv", adjust="BH")
  • This example includes a false discovery rate ("BH") adjusted p.value.
  • This function writes a tab-delimited text file containing for each gene (1) the average log-intensity, (2) the log-ratios, (3) moderated t-statistics, (4) t-statistic P-values, (5) F-statistic if available, (6) F-statistic P-values if available, (7) classification if available.
  • To add this annotation data to the expression set:
data = exprs(BSData.quantile)
data = cbind(anno,data)
write.csv = (data, file = "Filename.csv")
  • Remember that the expression set is Log2 adjusted, so to look at absolute expression levels use 2^value.