Difference between revisions of "Using Bioconductor To Analyse Beadarray Data"
From Bridges Lab Protocols
Davebridges (Talk | contribs) m |
Davebridges (Talk | contribs) m |
||
(17 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category: R]] | [[Category: R]] | ||
[[Category: Bioinformatics]] | [[Category: Bioinformatics]] | ||
+ | [[Category: Bioconductor]] | ||
==Software Requirements== | ==Software Requirements== | ||
− | *R, get from | + | *R, get from [http://cran.r-project.org/ CRAN] |
− | *Bioconductor, get from | + | *Bioconductor, get from [http://www.bioconductor.org/download Bioconductor] |
*Bioconductor packages. Install as needed: | *Bioconductor packages. Install as needed: | ||
**beadarray | **beadarray | ||
**limma | **limma | ||
+ | **annotation data for the array (normally illuminaMousev2BeadID.db) | ||
+ | *To install bioconductor packages use: | ||
<pre> | <pre> | ||
source("http://www.bioconductor.org/biocLite.R") | source("http://www.bioconductor.org/biocLite.R") | ||
Line 25: | Line 28: | ||
*You may need to alter either the ProbeID or ControlID to fit the illuminaprobe column from the sampleprobe or controlprobe datasets. | *You may need to alter either the ProbeID or ControlID to fit the illuminaprobe column from the sampleprobe or controlprobe datasets. | ||
*This fits the data into the BSData dataframe. Phenotype data can be accessed by pData(BSData) and expression data can be accessed by exprs(BSData). | *This fits the data into the BSData dataframe. Phenotype data can be accessed by pData(BSData) and expression data can be accessed by exprs(BSData). | ||
+ | |||
+ | ==Data Normalisation== | ||
+ | *Microarray data is typically quantile normalised and log2 transformed: | ||
+ | <pre>BSData.quantile = normaliseIllumina(BSData, method="quantile", transform="log2")</pre> | ||
+ | *To examine the effects of normalisation on the dataset use boxplots: | ||
+ | <pre> | ||
+ | boxplot(as.data.frame(log2(exprs(BSData))),las=2,outline=FALSE, ylab="Intensity (Log2 Scale)") | ||
+ | boxplot(as.data.frame(exprs(BSData.quantile)),las=2,outline=FALSE, ylab="Intensity (Log2 Scale)") | ||
+ | </pre> | ||
+ | *Save these boxplots as postscript files. | ||
+ | |||
+ | ==Clustering Analysis== | ||
+ | *This analysis will generate a euclidean distance matrix then a cluster analysis of that matrix and will show the distribution between replicates. Ideally similar treatments will cluster together. | ||
+ | <pre> | ||
+ | d = dist(t(exprs(BSData.quantile))) | ||
+ | plot(hclust(d) | ||
+ | </pre> | ||
+ | |||
+ | ==Differential Expression Analysis== | ||
+ | *Normalised data can be analysed using the limma package for statistical differences | ||
+ | *First define groups for each treatment. If a samplesheet was provided correctly and had this information: | ||
+ | <pre>samples = pData(BSData)$Sample_Group</pre> | ||
+ | *Otherwise define these groups manually in the order that they were entered, check by looking at pData(BSData) | ||
+ | <pre>samples = c("Control", "Control", "Treatment1", "Treatment1, "Treatment2"...)</pre> | ||
+ | *Next the groups are used to set up a statistical design: | ||
+ | <pre> | ||
+ | library(limma) | ||
+ | samples = as.factor(samples) | ||
+ | design = model.matrix(~0 + samples) | ||
+ | colnames(design) = levels(samples) | ||
+ | fit = lmFit(exprs(BSData.quantile), design) | ||
+ | </pre> | ||
+ | *Now set up contrast matrices to define how you want the data analyses. For example you may want to compare some treatments to a control, as well as between some treaments. See the limma user guide for more information about specific analyses. When defining the contrast matrix use the sample group names as defined above. | ||
+ | <pre> | ||
+ | cont.matrix = makeContrasts(Treatment1vsControl = Treatment1 - Control, Treatment2vsControl = Treatment2 - Control, Treatment1vsTreatment2 = Treatment1 - Treatment2, levels = design) | ||
+ | fit.cont = contrasts.fit(fit, cont.matrix) | ||
+ | ebFit = eBayes(fit.cont) | ||
+ | </pre> | ||
+ | ===Generating a Venn Diagram for Differential Expression=== | ||
+ | *First define a cutoff criteria for inclusion. One option is to use the decideTests function: | ||
+ | <pre>results = decideTests(ebFit)</pre> | ||
+ | *The relevant options are for method and adjust.method | ||
+ | **method | ||
+ | ***default is "global", which allows for p-value comparasons | ||
+ | **adjust.method, this defines the false-discovery rate adjustment: | ||
+ | ***default is "BH" for Benjami and Hochberg | ||
+ | ***other options are "none", "fdr" (same as BH), "holm" and "BY" | ||
+ | *Now use that classification to generate the Venn Diagram. The following will include both up and downregulated genes and color the numbers accordingly: | ||
+ | <pre>vennDiagram(results, include="both", col=c("red","green")</pre> | ||
+ | |||
+ | ==Annotation of Expression Sets and Fitted Data== | ||
+ | *To see all possible annotation criteria use: | ||
+ | <pre> | ||
+ | library(illuminaMousev2BeadID.db) | ||
+ | illuminaMousev2BeadID() | ||
+ | </pre> | ||
+ | *Normally you want to annotate with at least the gene symbol and gene name. Add other criteria as required | ||
+ | <pre> | ||
+ | ids = rownames(exprs(BSData)) | ||
+ | GeneName = mget(ids, illuminaMousev2BeadIDGENENAME, ifnotfound = NA) | ||
+ | symbol = mget(ids, illuminaMousev2BeadIDSYMBOL, ifnotfound = NA) | ||
+ | anno = cbind(GeneSymbol = as.character(symbol), GeneName = as.character(GeneName)) | ||
+ | </pre> | ||
+ | *To add this annotation to the data analysis file: | ||
+ | <pre> | ||
+ | ebFit$genes = anno | ||
+ | write.fit = (ebFit, file = "Filename.csv", adjust="BH") | ||
+ | </pre> | ||
+ | *This example includes a false discovery rate ("BH") adjusted p.value. | ||
+ | *This function writes a tab-delimited text file containing for each gene (1) the average log-intensity, (2) the log-ratios, (3) moderated t-statistics, (4) t-statistic P-values, (5) F-statistic if available, (6) F-statistic P-values if available, (7) classification if available. | ||
+ | *To add this annotation data to the expression set: | ||
+ | <pre> | ||
+ | data = exprs(BSData.quantile) | ||
+ | data = cbind(anno,data) | ||
+ | write.csv = (data, file = "Filename.csv") | ||
+ | </pre> | ||
+ | *Remember that the expression set is Log2 adjusted, so to look at absolute expression levels use 2^value. |
Latest revision as of 15:34, 2 September 2009
Contents
Software Requirements
- R, get from CRAN
- Bioconductor, get from Bioconductor
- Bioconductor packages. Install as needed:
- beadarray
- limma
- annotation data for the array (normally illuminaMousev2BeadID.db)
- To install bioconductor packages use:
source("http://www.bioconductor.org/biocLite.R") biocLite("PACKAGE")
Loading Data
- At a minimum you need the Probe Profile data (normally a txt file).
- For all R procedures first change directory to your working directory then next create a new script, and save all executed lines in that script file.
- Load the beadarray library, indictate dataFile (required), sampleSheet (normally a xls or csv file) and control set (Control Probe, normally a txt file)
data = "FinalReport_SampleProbe.txt" controls = "ControlProbe.txt" samplesheet = "Proj_54_12Aug09_WGGEX_SS_name.csv" BSData = readBeadSummaryData(dataFile = data, qcFile= controls, sampleSheet=samplesheet)
- You may need to alter either the ProbeID or ControlID to fit the illuminaprobe column from the sampleprobe or controlprobe datasets.
- This fits the data into the BSData dataframe. Phenotype data can be accessed by pData(BSData) and expression data can be accessed by exprs(BSData).
Data Normalisation
- Microarray data is typically quantile normalised and log2 transformed:
BSData.quantile = normaliseIllumina(BSData, method="quantile", transform="log2")
- To examine the effects of normalisation on the dataset use boxplots:
boxplot(as.data.frame(log2(exprs(BSData))),las=2,outline=FALSE, ylab="Intensity (Log2 Scale)") boxplot(as.data.frame(exprs(BSData.quantile)),las=2,outline=FALSE, ylab="Intensity (Log2 Scale)")
- Save these boxplots as postscript files.
Clustering Analysis
- This analysis will generate a euclidean distance matrix then a cluster analysis of that matrix and will show the distribution between replicates. Ideally similar treatments will cluster together.
d = dist(t(exprs(BSData.quantile))) plot(hclust(d)
Differential Expression Analysis
- Normalised data can be analysed using the limma package for statistical differences
- First define groups for each treatment. If a samplesheet was provided correctly and had this information:
samples = pData(BSData)$Sample_Group
- Otherwise define these groups manually in the order that they were entered, check by looking at pData(BSData)
samples = c("Control", "Control", "Treatment1", "Treatment1, "Treatment2"...)
- Next the groups are used to set up a statistical design:
library(limma) samples = as.factor(samples) design = model.matrix(~0 + samples) colnames(design) = levels(samples) fit = lmFit(exprs(BSData.quantile), design)
- Now set up contrast matrices to define how you want the data analyses. For example you may want to compare some treatments to a control, as well as between some treaments. See the limma user guide for more information about specific analyses. When defining the contrast matrix use the sample group names as defined above.
cont.matrix = makeContrasts(Treatment1vsControl = Treatment1 - Control, Treatment2vsControl = Treatment2 - Control, Treatment1vsTreatment2 = Treatment1 - Treatment2, levels = design) fit.cont = contrasts.fit(fit, cont.matrix) ebFit = eBayes(fit.cont)
Generating a Venn Diagram for Differential Expression
- First define a cutoff criteria for inclusion. One option is to use the decideTests function:
results = decideTests(ebFit)
- The relevant options are for method and adjust.method
- method
- default is "global", which allows for p-value comparasons
- adjust.method, this defines the false-discovery rate adjustment:
- default is "BH" for Benjami and Hochberg
- other options are "none", "fdr" (same as BH), "holm" and "BY"
- method
- Now use that classification to generate the Venn Diagram. The following will include both up and downregulated genes and color the numbers accordingly:
vennDiagram(results, include="both", col=c("red","green")
Annotation of Expression Sets and Fitted Data
- To see all possible annotation criteria use:
library(illuminaMousev2BeadID.db) illuminaMousev2BeadID()
- Normally you want to annotate with at least the gene symbol and gene name. Add other criteria as required
ids = rownames(exprs(BSData)) GeneName = mget(ids, illuminaMousev2BeadIDGENENAME, ifnotfound = NA) symbol = mget(ids, illuminaMousev2BeadIDSYMBOL, ifnotfound = NA) anno = cbind(GeneSymbol = as.character(symbol), GeneName = as.character(GeneName))
- To add this annotation to the data analysis file:
ebFit$genes = anno write.fit = (ebFit, file = "Filename.csv", adjust="BH")
- This example includes a false discovery rate ("BH") adjusted p.value.
- This function writes a tab-delimited text file containing for each gene (1) the average log-intensity, (2) the log-ratios, (3) moderated t-statistics, (4) t-statistic P-values, (5) F-statistic if available, (6) F-statistic P-values if available, (7) classification if available.
- To add this annotation data to the expression set:
data = exprs(BSData.quantile) data = cbind(anno,data) write.csv = (data, file = "Filename.csv")
- Remember that the expression set is Log2 adjusted, so to look at absolute expression levels use 2^value.