Tutorial - Filtering and Normalizing

Revision as of 19:05, 21 December 2006 by Smith (talk | contribs) (Performing a Housekeeping Gene Normalization)

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Outline

This tutorial will cover:

  • Uses of filtering and normalization
  • Types of filtering and normalization in geWorkbench
  • Example 1 - Combined normalization and filtering
  • Example 2 - using the Affymetrix Present/Absent/Marginal call filter

Overview

Filtering can be used to remove low quality data or reduce the size of the dataset by removing less interesting data. Normalization can be used to decrease the effects of systematic differences across a set of microarrays, allowing better cross-microarray comparisons.

In geWorkbench, filtering and normalization alter the loaded dataset; the original is not retained. The effect of filtering out a value is to mark it internally as "Missing". Many types of analysis require that missing values be dealt with properly. The Missing Values Filter allows markers that have more than a specified number of missing values to be removed. Another option is to use the Missing Value Computation found in Normalizers, which will replace missing values with imputed values. Some analysis routines have built-in methods for replacing missing values. Normalization results in the replacement of existing data values with new values.

Direct support for methods such as RMA and GCRMA, which include a normalization step, is not yet directly available in geWorkbench. Affymetrix CEL files can be processed externally to geWorkbench using a program such as RMAExpress (available for Windows computers) or R/Bioconductor and then imported into geWorkbench.

A short introduction to various methods of Affymetrix data preparation is available at Affymetrix Preprocessing.

Filters

geWorkbench comes with the following filters installed:


Filter Description
Affy Detection Call Applicable to Affymetrix data only. Sets all measurements whose detection status is any user-defined combination of P, A or M (Present, Absent, Marginal) as missing.
Missing values Discards all markers that have “missing” measurements in more than the maximum number N microarrays, where N is set by the user. Missing values can arise either from the original data or from the results of another filtering step.
Deviation Marks as missing all markers whose deviation is less than a given value across all microarrays.
Expression Threshold Marks as missing all markers whose expression values are inside (or outside) a user-defined range. For example, when "inside range" is selected, all expression values beteen the minimum and maximum values given will be marked missing.
2 Channel Applicable to 2-channel arrays (Genepix) data only. Defines applicable ranges for each channel, and marks as missing all expression measurements for which either channel intensity is inside (or outside) the defined range.
GenePix Flags Remove values flagged in the Genepix software.

Normalizers

geWorkbench comes with the following normalization routines installed:


Normalizer Description
Missing value computation Replaces every missing value with either the mean value of that marker across all microarrays or with the mean measurement of all markers in the microarray where the missing value is observed
Log2 Transformation Applies a log2 transformation to all measurements in a microarray
Threshold Normalizer All data points whose value is less than (or greater than) a user-specified minimum (maximum) value are raised (reduced) to that minimum (maximum) value
Marker-based centering Subtracts the mean (median) measurement of a marker profile from every measurement in the profile
Array-based centering Subtracts the mean (median) measurement of a microarray from every measurement in that microarray
Mean-variance normalizer For every marker profile, the mean measurement of the entire profile is subtracted from each measurement in the profile and the resulting value is divided by the standard deviation
House-keeping genes normalizer Normalize all values such that the averaged expression value of specified house-keeping markers is the same on each microarray.
Quantile Normalizer Adjusts expression values so that the distribution of values is the same on each microarray, though which marker has which value varies.


Preparation

These examples use the microarray dataset file webmatrix2.exp, available in Download. Please refer to Tutorial - Projects and Data Files tutorial for assistance in loading a file. It contains unfiltered, unnormalized data.


Example 1: Normalization followed by filtering

This example depicts several typical normalization and filtering steps. These steps will recreate the dataset "webmatrix_quantile_log2_dev1.2_mv0.exp", available as part of the tutorials data download.


  • In the Normalization component, select Quantile Normalization. The image below shows a view of the dataset prior to normalization.
  • Click Normalize.


T Normalization Quantile.png


  • Next, select Log2 Transformation.
  • Click Normalize.


T Normalization Log2.png


  • In the Filtering component, select the Deviation filter.
  • The picture below was obtained using a deviation bound of 1.0. However, to recreate the current example data set, please use a deviation bound of 1.2. Markers whose deviation is less than this measured across all arrays will have each value marked missing.
  • Click on Filter.
  • Missing values are displayed in yellow in the Microarray Viewer, as in the result shown below.


T Filtering Deviation.png


  • Select the Missing Values filter. A cutoff of zero can be used because the markers have been set to missing on each array.
  • Click on Filter. This will remove all the markers whose values have been marked missing.

T Filtering MissingValuesResult.png


Using these settings, 2226 out of the original 12,600 markers remain.

Example 2: Filtering out data called absent in an Affymetrix file

  • In the Filtering Panel, select Affy Detection Call Filter.

Filterpanel.gif


  • Select the ‘A’ (Absent) checkbox and press Filter. Values that were called Absent in the original dataset are highlighted in yellow in the Microarray Viewer, as in the example above. Internally, the values are now marked as Missing.
  • In the Filtering component, select Missing Values Filter.
  • Choose the maximum number of arrays that can have missing values before marker is removed – the default is 0.
  • Click Filter. Markers with more than 0 missing values are removed.


Example 3: Normalization using Housekeeping genes

geWorkbench provides a component for using a set of housekeeping genes to scale the expression value of each array. A set of probe/gene id's is read in, and the expression levels of these genes are averaged.

Note on use of log transformed data

1. The use of the Housekeeping Gene Normalization component is offered as a convenience in geWorkbench. It may or may not be suitable for a given platform or type of experiment. There are other algorithms for such normalizations available, see Affymetrix Preprocessing.

2. We suggest performing housekeeping gene normalization on log2 transformed data rather than linear scale data, because of the tremendous range of values often seen in the linear data. Averaging the log values is similar to taking a geometric mean. (Affymetrix MAS5 software averages the logs of individual probe values to calculate the final probeset value for a given chip).



Performing a Housekeeping Gene Normalization

  1. Read in the tutorial data file "webmatrix2.exp"
  2. Perform a log2 transformation of the data in the Normalization component
  3. Select the Housekeeping Genes normalizer.
  4. Read in a set of housekeeping gene candidates. A small file with 5 such probes is found in the geWorkbench installation, under the data directory. The file is named "housekeeping_markers.csv".
  5. The list of genes will appear in the right-side box labeled current selected genes. If you would like to exlude one, select it and use the left-pointing arrow to move it to the "Excluded Genes" box. The result is shown in the figure below.
  6. Push "Normalize" to perform the operation. The data file is transformed. Remember, the original data in the Project Folders component is overwritten; this operation is not reversible.


T Housekeeping Genes Normalizer.png

Housekeeping Gene normalizer.