User:Osborne

Revision as of 10:22, 30 August 2006 by Osborne (talk | contribs)

Basic Microarray Analysis with geWorkbench

Getting Started

  1. Requirements (Windows/Mac/Linux OS, Java 1.5 installed, at least 512 MB RAM)
  2. Installation
    1. geWorkbench downloads
    2. Java downloads

Background

Correct and complete microarray analysis requires both an understanding of the actual experiment and the statistical and mathmatical tools being used. The tools and techniques being used will vary depending on the type of experiment and what knowledge the user hopes to gain from the experiment. Here we will describe a how to go about analyzing one of the most common types of microarray experiments - differential gene expression on Affymetrix arrays. Most of the techniques described should be suitable for other types of analysis when appropriately modified, but the user is cautioned against applying them blindly to their own data.

Introduction

This tutorial walks the user through a fairly typical microarray experiment done using the Affymetrix HGU133Plus2 platform. In this case the experiments is a study of multiple myeloma resistance, it investigates 3 cell lines established from a patient resistant to glucocortoids. The 3 cell lines are:

  1. MM.1S expresses mostly the normal receptor. (C2E3)
  2. MM.Re expresses a small amount of normal receptor, but more alternatively spliced receptor (which is non-functional) (P1414)
  3. MM.1Rl expresses very little receptor of any kind. (P1310)

Cell Line History

The goal is to get a baseline measurement to find the difference in expression between the three cell lines, no large difference is expected because they all orginate from the same ancestor. Splicing differences are expected.

Step 1 - Inspect your data

The first step is to inspect visually a least one microarray. With the release of version 1.04 caWorkbench can now read in the Affymetrix CEL file format natively. However due to data structure incompatibility problems it can not yet do anything other than display the data unless it is pre-processed using R (geWorkbench can directly access an R server). The ability to see an image of the microarray is still useful, because it is worthwhile to make sure there are not obvious errors (streaking, etc...) on the microarray you are about to analyze. CEL files can be loaded by selecting type 'CEL' in the file loader. You should be able to see an image as shown below.


CEL File Image


This image looks good, so we can continue. Inspect all of your images for defects if you want to be diligent. If you don't want to bother it is not usually a problem, but is may be worth doing if you run into problems later.


Step 2 Data Preparation

The next stepe is to prepare the data for loading. Affymetrix data is best imported into geWorkbench as tab-delimited files which contain 3 columns. The first column is the probeset identifier, the 2nd an optional annotation for the column, and the 3rd column contains the signal data. This file format is referred to in geWorkbench as "Affymetrix File Matrix" format and in order to be recognized by the file loading component of geWorkbench the filename should end with .exp. This is not to be confused with the Affymetrix .exp (experiment) file which is *not* loaded by geWorkbench. While this file cannot be directly generated from Affymetrix software, the CEL files or spreadhsheets that are generated by Affymetrix software can be modified into this format. Just use a spreadsheet program such as Excel.
  1. Generation from a spreadsheet
The easiest way to get data from a small number of microarray files is to load in a file and modify in a spreadsheet program. Below is a result for a typical experiment.


Initial spreadsheet file

The user should modify the spreadsheet to get it to look like the picture below. The 1st column title should be AffyID, the 2nd Annotation and the 3rd the name of the Microarray being analyzed.


Modified spreadsheet file

It should then be saved in tab delimited format with an .exp extension. This needs to be done for ALL microarrays in the experiment. If you are dealing with dozens or even hundreds of files, trying using R to generate the file instead.


  1. From CEL files
FIXME (obsolete with 1.04)


Starting the Application

On Windows click on Start -> Program -> geWorkbench 1.04
We have no success running geWorkbench from the Mac or in Linux.

Load the data

  1. From the top part of the menu click on File -> New ->Project
  2. File -> Open
    1. Use the shift button to select all 9 files
    2. Click on the checkbox for merge files
    3. Select 'OK'. The loading dialog box is shown below.


MicroarrayAnalysisTutorialLoadingMergedSet.JPG

    1. Select the appropriate chip type, most common Affy File types are provided. If your file type is missing you will have to add the

library files (downloaded from Affy's website) to the geWorkbench directory (FIXME - need more info)

    1. With 9 chips from the HGU133Plus2 platform it will take a couple of minutes to load up. When you're done check and inspect that each of the chips was loaded successfully with the correct number. One quick way to do this is to look at the Microarray Tabular Viewer as shown below.


MicroarrayAnalysisTutorialLoadedDataTabularMicroView.JPG


Log 2 Transform (optional)

If your data has not been log2 transformed (our test data hasn't) it is a good idea to do so, the distribution is closer to normal on the log2 scale. To do this, follow the steps below.

  1. Select Normalizer from the bottom right hand portion of the screen
  2. Select Log2 Transformation
  3. Click on the normalization button


MicroArrayAnalysis TutorialLog2TransformPreNormalization.JPG

Normalize

We will use quantile normalization to ensure the same expression value distribution across all the microarrays. To do so we will:

  1. Select Normalizer from the bottom right hand portion of the screen
  2. Select Quantile Normalization
  3. Select Mean Profiling to handle any missing values.
  4. Click on the Normalize button in the bottom right

A picture is shown below with the normalization window and the results of the normalization in the microarray tabular viewer.


MicroarrayAnalysisTutorialLog2TransThenNormalized.JPG


This can be verified by looking at the Expression Value Distribution plot, since quantile normalization should normalize expression across all microarrays. Click on the expression value distribution tab to verify that all the sets follow the same pattern.

MicroarrayTutorialAnalysisEVDPostNormalization.JPG


Classification - Setting up controls and cases

In order for geWorkbench to garner some understanding of what the experiment is about, we need to set up our experimental classes. We have 3 different groups of data, wild type expression with C2E3, altered expression with P1.414 and low expression with P1.310. To classify our groups do the following:

  1. Use Shift and the left mouse button to select all 3 microarrays with P1.310 in the selection panel
  2. Right mouse click on this P1.310 set
  3. Select Classification -> Case to assign this set as a test case. Call it "Low".
  4. Repeat with the set of 3 microarrays for P1.414, setting them as a case as well. Call this set "Altered".

We do not have to do anything to specify C2E3 as a control, this is the default option. At the end the classification


T-Test - Finding differentially expressed genes

We are going to do a simple T-Test to locate genes that are differentially expressed. A T-Test only compares 2 cases so we will first start by looking at differences between the control and the altered gene expression of P1.414


  1. Do not repeat for C2E3, the default classification is control and this is the control
  2. Select only the control C2E3 dataset (wildtype expression) and one of the cases (illustrated below is P1.414)
  3. Select T-Test
  4. Select correction method (Adjusted Bonferroni)
  5. Select Analyze (results should finish in a few seconds)


The results are below.


TTestVolcanoAlteredvsNormal.JPG


By mousing over the dots you can identify the individual probes responsible for that value. To get a more comprehensive list, investigate the color mosaic by selecting the Color Mosaic tabbed pane in the upper middle portion of the main panel. Results are shown below.


MicroarrayAnalysisTutorialColorMosaicTTestNormalvsAltered.JPG


Ideally because we have 3 separate classes of data we would like to do an ANOVA test to find our significant genes. geWorkbench does not support ANOVA but it does allow for multiple easy T-Testing. To do this:

  1. Select the "Analysis" tab
  2. Select multi-T-Test
  3. Select critical p-value (I left it at the default of 0.1) (FIXME, is this corrected?)
  4. Select the comparison panels, in this case the wildtype expression of C2E3, the altered expression of 1.414, and the low expression of 1.313
  5. Select Analyze
    1. 3 Seperate T-Tests should be run
    2. T-Test results for all 3 appear in the project window
    3. Select the Markers panel in the selection panel (not Arrays/Phenotypes). It will list the significant genes for each of the T-Tests. In this case we have 131, 161 and 178 genes for each of the 3 combinations as seen in the picture below. We can use this to do further analysis.


MicroarrayAnalysisTutorialMultiTTest.JPG

GO Analysis

Promoter Analysis

Pattern Discovery

  1. Other Analysis