Difference between revisions of "T-test"

(Volcano Plot Visualizer)
(Fold Change)
 
(39 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
=Overview=
 
=Overview=
 +
 +
For the '''geWorkbench web''' version of T-Test please see [[T-Test_web]].
  
 
A t-Test analysis can be used to identify markers with statistically significant differential expression between two sets of microarrays.  In geWorkbench, these groups are specified as the "Case" and "Control" sets.   
 
A t-Test analysis can be used to identify markers with statistically significant differential expression between two sets of microarrays.  In geWorkbench, these groups are specified as the "Case" and "Control" sets.   
  
 
There are several steps to setting up a t-test analysis in geWorkbench.
 
There are several steps to setting up a t-test analysis in geWorkbench.
# At least two sets of arrays must be available in the [[Data_Subsets_-_Arrays | Arrays]] component.
+
# At least two sets of arrays must be available in the [[Array_Sets | Arrays]] component.
# The array sets to be used in the analysis must be "activated" by checking the box adjacent to their names in the [[Data_Subsets_-_Arrays | Arrays]] component.
+
# The array sets to be used in the analysis must be "activated" by checking the box adjacent to their names in the [[Array_Sets | Arrays]] component.
 
# One or more activated array sets must be designated "Case", and the others "Control" (which is the default classification).
 
# One or more activated array sets must be designated "Case", and the others "Control" (which is the default classification).
 
# The t-test parameters must be set.
 
# The t-test parameters must be set.
Line 15: Line 17:
 
After the t-test is run, the results will be displayed graphically, and all markers meeting the significance threshold are placed into a new Marker Set called "Significant Genes".
 
After the t-test is run, the results will be displayed graphically, and all markers meeting the significance threshold are placed into a new Marker Set called "Significant Genes".
  
Please see the [[Differential_Expression#Example | Example]] section below for instructions on preparing array sets for the t-test analysis.
+
Please see the [[T-test#Example | Example]] section below for instructions on preparing array sets for the t-test analysis.
 +
 
 +
As of geWorkbench 2.4.0, the t-test result is calculated using the Apache Commons Math Library.
  
 
=t-Test Parameters=
 
=t-Test Parameters=
Line 35: Line 39:
  
  
[[Image:T-test_Pvalue_params.png]]
+
[[Image:T-test_Pvalue_params.png|{{ImageMaxWidth}}]]
  
 +
==Alpha corrections==
 +
Several  methods for correcting for the effects of performing multiple tests are offered, including Bonferroni and False Discovery Rate control.  They differ in how they compare the calculated p-value to the cutoff value of alpha - the critical p-value for determining the significance of an observed difference.
  
==Alpha corrections==
+
* '''Just Alpha''' - No correction is performed.
For multiple testing (alpha) correction, the following options are offered:
+
* '''Standard Bonferroni''' - The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values.
* '''no correction'''
+
* '''Adjusted (step-down) Bonferroni''' - Similar to the Bonferroni correction, but for each successive P-value in a list of p-values sorted in increasing order, the divisor for alpha is decremented by one and then the result compared with the P-value.  The effect is to slightly reduce the stringency (increase the power) of the Bonferroni correction.  This is a step-down procedure.
* '''Standard Bonferonni Correction''' - the value of alpha is divided by the number of markers included in the analysis.
 
* '''Adjusted (step down) Bonferonni Correction'''
 
 
* Two variants of the Westfall and Young method are available if the p-value is estimated by permuation:
 
* Two variants of the Westfall and Young method are available if the p-value is estimated by permuation:
 
** '''minP'''
 
** '''minP'''
Line 48: Line 52:
  
  
[[Image:T-test_alpha_corrections.png]]
+
[[Image:T-test_alpha_corrections.png|{{ImageMaxWidth}}]]
  
 
==Degrees of Freedom==
 
==Degrees of Freedom==
Line 56: Line 60:
  
  
[[Image:T-test_degrees_of_freedom.png]]
+
[[Image:T-test_degrees_of_freedom.png|{{ImageMaxWidth}}]]
  
 
=Example=
 
=Example=
 
==Preparation==
 
==Preparation==
  
Obtain the file "BCell-100.exp", which is contained in the data/public_data directory of the geWorkbench distribution, or can be directly downloaded from the tutorial data [[Download_and_Installation#Tutorial_data | download]] area.
+
This example uses the file [[Media:Bcell-100.zip|Bcell-100.zip]], which also is contained in the data/public_data directory of the geWorkbench distribution, and is further described in the tutorial data [[Tutorial_Data | Tutorial_Data]] area.  [[Media:Bcell-100_log2.zip|A version]] with the threshold normalizer and log2 transformations described below already applied is also available there.
  
 
You may also wish to load the Affymetrix HG-U95Av2 annotation file, although it is not required for this example.  See the [[FAQ | FAQ]] section for information on downloading this file from Affymetrix.
 
You may also wish to load the Affymetrix HG-U95Av2 annotation file, although it is not required for this example.  See the [[FAQ | FAQ]] section for information on downloading this file from Affymetrix.
  
  
For tips on loading data files, see [[Local_Data_Files | Local Data Files]] and [[Projects | Projects]].
+
For tips on loading data files, see [[Local_Data_Files | Local Data Files]] and [[Workspace|Workspace]].
  
 
In this example, we apply two normalization steps to the data set.
 
In this example, we apply two normalization steps to the data set.
Line 105: Line 109:
 
The thumbtack image next to activated Array Sets is colored red.
 
The thumbtack image next to activated Array Sets is colored red.
  
==Seting the Analysis Parameters==
+
==Setting the Analysis Parameters==
 +
#In this example, prior to the t-test, the BCell-100.exp data was threshold normalized to a minimum value of 1, then log2 normalized.
 
# The t-test component should be loaded by default in the [[Component_Configuration_Manager|Component Configuration Manager]].
 
# The t-test component should be loaded by default in the [[Component_Configuration_Manager|Component Configuration Manager]].
 
# From the Analysis Panel, select '''T-Test Analysis'''.  
 
# From the Analysis Panel, select '''T-Test Analysis'''.  
Line 111: Line 116:
 
## P-values based on t-distribution.
 
## P-values based on t-distribution.
 
## Note that here the default alpha (critical p-value) is set to 0.01.
 
## Note that here the default alpha (critical p-value) is set to 0.01.
## Check-mark the box "Data is log2 Transformed".
+
## If the data has been log2 transformed, check-mark the box "Data is log2 Transformed".
 
# Alpha-corrections tab
 
# Alpha-corrections tab
 
##  Standard Bonferonni
 
##  Standard Bonferonni
Line 121: Line 126:
  
  
[[Image:T-test_Example_setup.png]]
+
[[Image:T-test_Example_setup.png|{{ImageMaxWidth}}]]
  
 
==Running the t-test analysis==
 
==Running the t-test analysis==
  
# Click '''Analyze'''.  The results will be returned in three locations: The Project Folder, the Markers component, and the Visualization area.
+
# Click '''Analyze'''.  The results will be returned in three locations: The [[Workspace|Workspace]], the Markers component, and the Visualization area.
  
 
=t-Test Results=
 
=t-Test Results=
 
==Result Sets==
 
==Result Sets==
A t-test result node is placed into the Projects Folder as a child of the microarray dataset that was analyzed (upper red arrow).
+
A t-test result node is placed into the [[Workspace|Workspace]] as a child of the microarray dataset that was analyzed.
  
The list of significant markers is placed into a new set in the Markers component (lower red arrow).  This set is labeled "Significant Genes".  The number in square brackets indicates the number of markers in the set.
+
The list of significant markers is placed into a new set in the Markers component.  This set is labeled "Significant Genes".  The number in square brackets indicates the number of markers in the set.
 +
 
 +
===Saving a Result Set===
 +
 
 +
The result node (in the [[Workspace|Workspace]]) can be saved by right-clicking on it and selecting "Save".  This will save the significant markers in a CSV (comma separated value) file with the following columns:
 +
 
 +
* Probe Set Name
 +
* Gene Name
 +
* p-Value
 +
* Fold Change (Log2)
 +
 
 +
Markers found significant in the t-test, but for which a fold change value could not be calculated, are included in the export file as "N/A".
 +
 
 +
===Fold Change===
 +
The method used to calculate fold change depends on whether the data was marked as log2 transformed or not during the t-test using the "Data is log2 transformed" box.
 +
 
 +
* '''Linear data''' ("Data is log2 transformed" box '''was not''' checked): the fold change is calculated, for each marker, as the Log2 transform of the average expression in the Case set divided by the average expression in the control set, that is,
 +
Log2(Avg(cases)/Avg(controls))
 +
or
 +
Log2(Avg(cases)) - Log2(Avg(controls)).  (Difference of logs of averaged values).
 +
 
 +
* '''Log2 transformed data''' ("Data is log2 transformed" box '''was''' checked): In this case, for each marker, the average of the (log) case values minus the average of the (log) control values is calculated, that is,
 +
Avg(cases) - Avg(controls).  (Difference of averaged log values).
 +
 
 +
The fold change is not calculated if, for the linear case, the average case or control value is negative.
  
 
==Volcano Plot Visualizer==
 
==Volcano Plot Visualizer==
  
  
[[Image:Volcano_plot.png]]
+
[[Image:Volcano_plot.png|{{ImageMaxWidth}}]]
 +
 
 +
 
 +
The Volcano Plot graphically depicts the results of the t-test for differential expression.  It includes only markers which exceeded the threshold for significance in the t-test.  The log2 fold change for each marker is plotted against the -log10 of the P-value.
 +
 
 +
Markers for which no valid fold-change value could be calculated (e.g. for the case of linear data the average of the case or control values was negative) are omitted from the Volcano Plot.  However, all such markers are included if the data is exported to file.
 +
 
 +
See the [[Volcano_Plot| Volcano Plot]] tutorial for further details. 
  
 +
===Technical Notes===
  
The Volcano Plot graphically depicts the results of the t-test for differential expression.  It includes only markers which exceeded the threshold for significance in the t-test.  The log2 fold change for each marker is plotted against the -log10 of the P-value.  Finally, the fold-change ratio is plotted on a log2 scale.
+
* If two data points have exactly the same coordinates, only the point which is "on top" will be shown when clicked-on or moused over.
  
See the [[Volcano_Plot| Volcano Plot]] tutorial for details on the calculation of the fold change, and on the color scheme used in the displayBriefly, the fold change is calculated as the average expression in the Case set divided by the average expression in the control setLog2 transformed data is converted back to linear values prior to the fold change calculation if the Log2 box was checked in the t-test parameter setup.
+
* If the graph has only one point, or has several points all with the exact same coordinates, the default JFreeChart graphing behavior may omit a scale on the X or Y axis.  The ranges of the axes and the labels can be manually adjustedRight-click on the X or Y axis label area and select Properties->RangeTurn off "auto-ranging" and set the desired ranges.
  
 
==Color Mosaic Visualizer==
 
==Color Mosaic Visualizer==
The [[Color_Mosaic|Color Mosaic]] tab shows all of the arrays and the p-value calculated for each marker.  By default, the markers are sorted by p-value.  The display of each type of annotation can be switched on and off.
+
The [[Color_Mosaic|Color Mosaic]] tab shows all arrays (or activated sets of arrays) and each significant marker with its p-value.  By default, the markers are sorted by p-value.  The display of each type of annotation can be switched on and off.
  
 
Please see the [[Color_Mosaic|Color Mosaic]] tutorial for a complete description of all the features and controls of this viewer.
 
Please see the [[Color_Mosaic|Color Mosaic]] tutorial for a complete description of all the features and controls of this viewer.
 +
 +
In the figure below, all display options have been activated, displaying array names at top, and p-value, accession (marker name), and gene name at left.  The light-bulb indicates a hover text for the cell pointed to by the red arrow.  The hover text displays the array (Chip) name, the marker name, and the signal value.
  
  
[[Image:T-Test_Color_Mosaic_Control_Descriptions.png]]
+
[[Image:T-Test_Color_Mosaic_Control_Descriptions.png|{{ImageMaxWidth}}]]

Latest revision as of 15:02, 13 March 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Overview

For the geWorkbench web version of T-Test please see T-Test_web.

A t-Test analysis can be used to identify markers with statistically significant differential expression between two sets of microarrays. In geWorkbench, these groups are specified as the "Case" and "Control" sets.

There are several steps to setting up a t-test analysis in geWorkbench.

  1. At least two sets of arrays must be available in the Arrays component.
  2. The array sets to be used in the analysis must be "activated" by checking the box adjacent to their names in the Arrays component.
  3. One or more activated array sets must be designated "Case", and the others "Control" (which is the default classification).
  4. The t-test parameters must be set.


After the t-test is run, the results will be displayed graphically, and all markers meeting the significance threshold are placed into a new Marker Set called "Significant Genes".

Please see the Example section below for instructions on preparing array sets for the t-test analysis.

As of geWorkbench 2.4.0, the t-test result is calculated using the Apache Commons Math Library.

t-Test Parameters

P-value Parameters

p-values based on

The p-values can be calculated by transforming the t-statistic directly, or by carrying out a permutation analysis. The permutation analysis measures how often a t-statistic at least as large as that observed occurs by chance after array labels of case and control are permuted.

  • t-distribution (the default)
  • Permutation - If chosen, the number of permutations to carry out must also be specified.

Overall alpha (Critical p-value)

The threshold for a difference in expression between Case and Control sets being called significant. A value of 0.05 is often used for a single test. Multiple-testing corrections can be specified in the Alpha Corrections tab.

Data is Log2-transformed

If the dataset has been Log2 transformed, check this box. Having this information allows the fold-change displayed in the Volcano Plot to be calculated in a consistent fashion.

The system will examine the current dataset and make a guess as to whether the data has been log2 transformed. The user can override this guess using the check box.


T-test Pvalue params.png

Alpha corrections

Several methods for correcting for the effects of performing multiple tests are offered, including Bonferroni and False Discovery Rate control. They differ in how they compare the calculated p-value to the cutoff value of alpha - the critical p-value for determining the significance of an observed difference.

  • Just Alpha - No correction is performed.
  • Standard Bonferroni - The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values.
  • Adjusted (step-down) Bonferroni - Similar to the Bonferroni correction, but for each successive P-value in a list of p-values sorted in increasing order, the divisor for alpha is decremented by one and then the result compared with the P-value. The effect is to slightly reduce the stringency (increase the power) of the Bonferroni correction. This is a step-down procedure.
  • Two variants of the Westfall and Young method are available if the p-value is estimated by permuation:
    • minP
    • maxT


T-test alpha corrections.png

Degrees of Freedom

Group variances can be declared as:

  1. unequal (Welch approximation) (default)
  2. Equal.


T-test degrees of freedom.png

Example

Preparation

This example uses the file Bcell-100.zip, which also is contained in the data/public_data directory of the geWorkbench distribution, and is further described in the tutorial data Tutorial_Data area. A version with the threshold normalizer and log2 transformations described below already applied is also available there.

You may also wish to load the Affymetrix HG-U95Av2 annotation file, although it is not required for this example. See the FAQ section for information on downloading this file from Affymetrix.


For tips on loading data files, see Local Data Files and Workspace.

In this example, we apply two normalization steps to the data set.

  1. Threshold Normalizer - set a minimum value of 1. Any value less than 1 will be set to 1.
  2. Log2 Transformation Normalizer - Log2 transform the data.

For an actual data analysis, you should apply data normalization steps appropriate to your own data and analysis design.

Array Classification

The t-test in geWorkbench requires that at least two sets of arrays be "activated". Only such "activated" sets are considered. In addition, at least one such set must be designated as "Case", and at least one other as "Control" (which is the default classification). Note that more than one set of arrays can be marked as "Case" or control.

Array set classification is covered in the Arrays/Phenotypes chapter. However, for convenience, the steps are illustrated here.

The desired sets of arrays should be activated in the Arrays/Phenotypes component. This is done by checking the boxes by the desired Sets.

T-test Set activation BCell.png


The classification can be made directly by left-clicking on the "thumb-tack" icon adjacent to an array set name.

T-test Set classification left click Bcell.png


The array classification can also be set by right-clicking on the desired array set and selecting "Classification":


T-test Set classification right click Bcell.png


Using either method, the desired array set can be classified as "Case":


T-test Set selection BCell.png


The thumbtack image next to activated Array Sets is colored red.

Setting the Analysis Parameters

  1. In this example, prior to the t-test, the BCell-100.exp data was threshold normalized to a minimum value of 1, then log2 normalized.
  2. The t-test component should be loaded by default in the Component Configuration Manager.
  3. From the Analysis Panel, select T-Test Analysis.
  4. P-value Parameters tab:
    1. P-values based on t-distribution.
    2. Note that here the default alpha (critical p-value) is set to 0.01.
    3. If the data has been log2 transformed, check-mark the box "Data is log2 Transformed".
  5. Alpha-corrections tab
    1. Standard Bonferonni
  6. Degree of Freedom tab
    1. Welch approximation - unequal group variances.


The P-value Parameters tab set for the example analysis:


T-test Example setup.png

Running the t-test analysis

  1. Click Analyze. The results will be returned in three locations: The Workspace, the Markers component, and the Visualization area.

t-Test Results

Result Sets

A t-test result node is placed into the Workspace as a child of the microarray dataset that was analyzed.

The list of significant markers is placed into a new set in the Markers component. This set is labeled "Significant Genes". The number in square brackets indicates the number of markers in the set.

Saving a Result Set

The result node (in the Workspace) can be saved by right-clicking on it and selecting "Save". This will save the significant markers in a CSV (comma separated value) file with the following columns:

  • Probe Set Name
  • Gene Name
  • p-Value
  • Fold Change (Log2)

Markers found significant in the t-test, but for which a fold change value could not be calculated, are included in the export file as "N/A".

Fold Change

The method used to calculate fold change depends on whether the data was marked as log2 transformed or not during the t-test using the "Data is log2 transformed" box.

  • Linear data ("Data is log2 transformed" box was not checked): the fold change is calculated, for each marker, as the Log2 transform of the average expression in the Case set divided by the average expression in the control set, that is,
Log2(Avg(cases)/Avg(controls)) 
or 
Log2(Avg(cases)) - Log2(Avg(controls)).  (Difference of logs of averaged values).
  • Log2 transformed data ("Data is log2 transformed" box was checked): In this case, for each marker, the average of the (log) case values minus the average of the (log) control values is calculated, that is,
Avg(cases) - Avg(controls).  (Difference of averaged log values).

The fold change is not calculated if, for the linear case, the average case or control value is negative.

Volcano Plot Visualizer

Volcano plot.png


The Volcano Plot graphically depicts the results of the t-test for differential expression. It includes only markers which exceeded the threshold for significance in the t-test. The log2 fold change for each marker is plotted against the -log10 of the P-value.

Markers for which no valid fold-change value could be calculated (e.g. for the case of linear data the average of the case or control values was negative) are omitted from the Volcano Plot. However, all such markers are included if the data is exported to file.

See the Volcano Plot tutorial for further details.

Technical Notes

  • If two data points have exactly the same coordinates, only the point which is "on top" will be shown when clicked-on or moused over.
  • If the graph has only one point, or has several points all with the exact same coordinates, the default JFreeChart graphing behavior may omit a scale on the X or Y axis. The ranges of the axes and the labels can be manually adjusted. Right-click on the X or Y axis label area and select Properties->Range. Turn off "auto-ranging" and set the desired ranges.

Color Mosaic Visualizer

The Color Mosaic tab shows all arrays (or activated sets of arrays) and each significant marker with its p-value. By default, the markers are sorted by p-value. The display of each type of annotation can be switched on and off.

Please see the Color Mosaic tutorial for a complete description of all the features and controls of this viewer.

In the figure below, all display options have been activated, displaying array names at top, and p-value, accession (marker name), and gene name at left. The light-bulb indicates a hover text for the cell pointed to by the red arrow. The hover text displays the array (Chip) name, the marker name, and the signal value.


T-Test Color Mosaic Control Descriptions.png