Difference between revisions of "T-test"

(Multi t-test)
(Fold Change)
 
(106 intermediate revisions by 2 users not shown)
Line 2: Line 2:
  
  
__TOC__
+
=Overview=
  
==Outline==
+
For the '''geWorkbench web''' version of T-Test please see [[T-Test_web]].
  
 +
A t-Test analysis can be used to identify markers with statistically significant differential expression between two sets of microarrays.  In geWorkbench, these groups are specified as the "Case" and "Control" sets. 
  
In this tutorial, you will:
+
There are several steps to setting up a t-test analysis in geWorkbench.
 +
# At least two sets of arrays must be available in the [[Array_Sets | Arrays]] component.
 +
# The array sets to be used in the analysis must be "activated" by checking the box adjacent to their names in the [[Array_Sets | Arrays]] component.
 +
# One or more activated array sets must be designated "Case", and the others "Control" (which is the default classification).
 +
# The t-test parameters must be set.
  
* Discuss the background for using Student's t-Test to evaluate data.
 
* Load a geWorkbench matrix format microarray dataset.
 
* Select classes of arrays to compare.
 
* Apply a t-Test.
 
* Visualize the results in the Color Mosaic and Volcano Plot components.
 
* Note the creation of resulting marker lists and visualization objects.
 
* Apply a multi t-test.
 
  
 +
After the t-test is run, the results will be displayed graphically, and all markers meeting the significance threshold are placed into a new Marker Set called "Significant Genes".
  
==Overview==
+
Please see the [[T-test#Example | Example]] section below for instructions on preparing array sets for the t-test analysis.
  
A t-Test analysis can be used to identify markers with statistically significant differential expression between two sets of microarrays. The t-test determines, for each marker, if there is a significant difference between the two groups (e.g. case and control). To perform this analysis, you must classify the sets, set the analysis parameters and view the results in the visualization components. The t-test implementation in geWorkbench offers several options for multiple testing correction and evaluation of the test statistic.  A detailed description of the t-Test parameters is also available in online help.
+
As of geWorkbench 2.4.0, the t-test result is calculated using the Apache Commons Math Library.
  
 +
=t-Test Parameters=
  
 +
==P-value Parameters==
 +
 +
===p-values based on===
 +
The p-values can be calculated by transforming the t-statistic directly, or by carrying out a permutation analysis.  The permutation analysis measures how often a t-statistic at least as large as that observed occurs by chance after array labels of case and control are permuted.
 +
* '''t-distribution''' (the default)
 +
* '''Permutation''' -  If chosen, the number of permutations to carry out must also be specified.
 +
 +
===Overall alpha (Critical p-value)===
 +
The threshold for a difference in expression between Case and Control sets being called significant.  A value of 0.05 is often used for a single test.  Multiple-testing corrections can be specified in the Alpha Corrections tab.
 +
 +
===Data is Log2-transformed===
 +
If the dataset has been Log2 transformed, check this box.  Having this information allows the fold-change displayed in the Volcano Plot to be calculated in a consistent fashion.
 +
 +
The system will examine the current dataset and make a guess as to whether the data has been log2 transformed.  The user can override this guess using the check box.
 +
 +
 +
[[Image:T-test_Pvalue_params.png|{{ImageMaxWidth}}]]
 +
 +
==Alpha corrections==
 +
Several  methods for correcting for the effects of performing multiple tests are offered, including Bonferroni and False Discovery Rate control.  They differ in how they compare the calculated p-value to the cutoff value of alpha - the critical p-value for determining the significance of an observed difference.
 +
 +
* '''Just Alpha''' - No correction is performed.
 +
* '''Standard Bonferroni''' - The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values.
 +
* '''Adjusted (step-down) Bonferroni''' - Similar to the Bonferroni correction, but for each successive P-value in a list of p-values sorted in increasing order, the divisor for alpha is decremented by one and then the result compared with the P-value.  The effect is to slightly reduce the stringency (increase the power) of the Bonferroni correction.  This is a step-down procedure.
 +
* Two variants of the Westfall and Young method are available if the p-value is estimated by permuation:
 +
** '''minP'''
 +
** '''maxT'''
 +
 +
 +
[[Image:T-test_alpha_corrections.png|{{ImageMaxWidth}}]]
 +
 +
==Degrees of Freedom==
 +
Group variances can be declared as:
 +
# unequal (Welch approximation) (default)
 +
# Equal.
 +
 +
 +
[[Image:T-test_degrees_of_freedom.png|{{ImageMaxWidth}}]]
 +
 +
=Example=
 
==Preparation==
 
==Preparation==
  
Use the file "webmatrix_quantile_log2_dev1.2_mv0.exp", which is contained in the downloadable zip archive '''tutorial_data.zip'''.  See the [[Download]] area.
+
This example uses the file [[Media:Bcell-100.zip|Bcell-100.zip]], which also is contained in the data/public_data directory of the geWorkbench distribution, and is further described in the tutorial data [[Tutorial_Data | Tutorial_Data]] area.  [[Media:Bcell-100_log2.zip|A version]] with the threshold normalizer and log2 transformations described below already applied is also available there.
 +
 
 +
You may also wish to load the Affymetrix HG-U95Av2 annotation file, although it is not required for this example.  See the [[FAQ | FAQ]] section for information on downloading this file from Affymetrix.
  
The result screenshots below were generated using an earlier dataset, which was obtained by filtering with a deviation bound of 1.0.  The dataset currently supplied was created using a deviation bound of 1.2, so results will differ slightly from those shown. 
 
  
For tips on loading data files, see the section [[Tutorial - Projects and Data Files]].
+
For tips on loading data files, see [[Local_Data_Files | Local Data Files]] and [[Workspace|Workspace]].
  
==t-Test Parameters==
+
In this example, we apply two normalization steps to the data set.
  
===P-value===
+
# [[Normalization#Threshold_Normalizer | Threshold Normalizer]] - set a minimum value of 1.  Any value less than 1 will be set to 1.
 +
# [[Normalization#Log2_Transformation | Log2 Transformation Normalizer]] - Log2 transform the data.
  
The p-value can be estimated from
+
For an actual data analysis, you should apply data normalization steps appropriate to your own data and analysis design.
1.  the t-statistic (the default) or
 
2.  by permutation.
 
  
===Alpha corrections===
+
==Array Classification==
For multiple testing (alpha) correction, the following options are offered:
+
The t-test in geWorkbench requires that at least two sets of arrays be "activated".  Only such "activated" sets are considered. In addition, at least one such set must be designated as "Case", and at least one other as "Control" (which is the default classification).   Note that more than one set of arrays can be marked as "Case" or control.
1. no correction
 
2. Standard Bonferonni Correction
 
3. Adjusted (step down) Bonferonni Correction.
 
4. Additional methods are available if the p-value is being estimated by permuation.
 
  
===Degrees of Freedom===
+
Array set classification is covered in the [[Data_Subsets_-_Arrays |Arrays/Phenotypes]] chapter.  However, for convenience, the steps are illustrated here.
Group variances can be declared as:
+
 
1. unequal (Welch approximation) (the default)
+
The desired sets of arrays should be activated in the [[Data_Subsets_-_Arrays |Arrays/Phenotypes]] component. This is done by checking the boxes by the desired Sets.
2. Equal.
 
  
 +
[[Image:T-test_Set_activation_BCell.png]]
  
===Classification===
 
  
The desired sets of arrays should be activated in the Arrays/Phenotypes component.  This is done by checking the boxes by the desired Sets.
 
  
[[Image:T_t-test_Set_selection_BCELL.png]]
+
The classification can be made directly by left-clicking on the "thumb-tack" icon adjacent to an array set name.
  
 +
[[Image:T-test_Set_classification_left_click_Bcell.png]]
  
The t-test requires two groups of microarrays to compare.  geWorkbench distinguishes the two groups by one being labeled as "Case".  By default, all others are considered as control.  Note that in the Arrays/Phenotypes component, more than one set of arrays can be marked "Case".  All remaining (activated) arrays will then be in the "Control" group.
 
  
 +
The array classification can also be set by right-clicking on the desired array set and selecting "Classification":
  
[[Image:T_t-test_set_Case_BCELL_webm_qldm.png]]
 
  
 +
[[Image:T-test_Set_classification_right_click_Bcell.png]]
  
The thumbtack image next to activated Array Sets is colored red.
 
  
 +
Using either method, the desired array set can be classified as "Case":
  
[[Image:T_t-test_Arrays_case-set_BCELL_webm_qldm.png]]
 
  
 +
[[Image:T-test_Set_selection_BCell.png]]
  
===Set Analysis Parameters===
 
* From the Analysis Panel, select '''T-Test Analysis'''.
 
* Various parameters can be adjusted as desired.  Here we will use the Standard Bonferonni method, which is the strictest.
 
  
* Alpha-corrections tab:  Standard Bonferonni.  
+
The thumbtack image next to activated Array Sets is colored red.
  
[[Image:T_t-test_bonferroni_BCELL_webm_qldm.png]]
+
==Setting the Analysis Parameters==
 +
#In this example, prior to the t-test, the BCell-100.exp data was threshold normalized to a minimum value of 1, then log2 normalized.
 +
# The t-test component should be loaded by default in the [[Component_Configuration_Manager|Component Configuration Manager]].
 +
# From the Analysis Panel, select '''T-Test Analysis'''.
 +
# P-value Parameters tab:
 +
## P-values based on t-distribution.
 +
## Note that here the default alpha (critical p-value) is set to 0.01.
 +
## If the data has been log2 transformed, check-mark the box "Data is log2 Transformed".
 +
# Alpha-corrections tab
 +
##  Standard Bonferonni
 +
# Degree of Freedom tab
 +
## Welch approximation - unequal group variances.
  
  
* P-Value Parameters tab: p-values based on t-distribution. Note that the default alpha (critical p-value) is set to 0.01.
+
The P-value Parameters tab set for the example analysis:
  
[[Image:T_t-test_p-values.png]]
 
  
 +
[[Image:T-test_Example_setup.png|{{ImageMaxWidth}}]]
  
* Degree of Freedom tab: Welch approximation - unequal group variances.
+
==Running the t-test analysis==
  
[[Image:T_t-test_dof.png]]
+
# Click '''Analyze'''.  The results will be returned in three locations: The [[Workspace|Workspace]], the Markers component, and the Visualization area.
  
 +
=t-Test Results=
 +
==Result Sets==
 +
A t-test result node is placed into the [[Workspace|Workspace]] as a child of the microarray dataset that was analyzed. 
  
After all the parameters have been set, click '''Analyze'''.  The results will be returned in three locations: The Project Folder, the Markers component, and the Visualization area.
+
The list of significant markers is placed into a new set in the Markers component.  This set is labeled "Significant Genes".  The number in square brackets indicates the number of markers in the set.
  
===t-Test Results===
+
===Saving a Result Set===
  
The result is placed into the Projects Folder as a child of the microarray dataset that was analyzed.
+
The result node (in the [[Workspace|Workspace]]) can be saved by right-clicking on it and selecting "Save".  This will save the significant markers in a CSV (comma separated value) file with the following columns:
  
[[Image:T_t-test_ProjectFolders_result.png]]
+
* Probe Set Name
 +
* Gene Name
 +
* p-Value
 +
* Fold Change (Log2)
  
 +
Markers found significant in the t-test, but for which a fold change value could not be calculated, are included in the export file as "N/A".
  
The results are displayed by default using the Volcano Plot visualizer.
+
===Fold Change===
 +
The method used to calculate fold change depends on whether the data was marked as log2 transformed or not during the t-test using the "Data is log2 transformed" box.
  
[[Image:T_t-test_volcano_BCELL_webm_qldm.png]]
+
* '''Linear data''' ("Data is log2 transformed" box '''was not''' checked): the fold change is calculated, for each marker, as the Log2 transform of the average expression in the Case set divided by the average expression in the control set, that is,
 +
Log2(Avg(cases)/Avg(controls))
 +
or
 +
Log2(Avg(cases)) - Log2(Avg(controls)).  (Difference of logs of averaged values).
  
 +
* '''Log2 transformed data''' ("Data is log2 transformed" box '''was''' checked): In this case, for each marker, the average of the (log) case values minus the average of the (log) control values is calculated, that is,
 +
Avg(cases) - Avg(controls).  (Difference of averaged log values).
  
The adjacent tab provides a Color Mosaic showing all of the arrays and the p-value calculated for each marker.  It also can display annotation for each marker.
+
The fold change is not calculated if, for the linear case, the average case or control value is negative.
  
[[Image:T_t-test_colormosaic_BCELL_webm_qldm.png]]
+
==Volcano Plot Visualizer==
  
  
The set of markers which met the minimum signifcance criterion are placed into a new Marker Set labeled "Significant Genes" in the Markers component.  The number of markers is shown also.
+
[[Image:Volcano_plot.png|{{ImageMaxWidth}}]]
  
[[Image:T_t-test_Markers_BCELL_result.png]]
 
  
 +
The Volcano Plot graphically depicts the results of the t-test for differential expression.  It includes only markers which exceeded the threshold for significance in the t-test.  The log2 fold change for each marker is plotted against the -log10 of the P-value.
  
 +
Markers for which no valid fold-change value could be calculated (e.g. for the case of linear data the average of the case or control values was negative) are omitted from the Volcano Plot.  However, all such markers are included if the data is exported to file.
  
==Multi t-test==
+
See the [[Volcano_Plot| Volcano Plot]] tutorial for further details. 
  
*The Multi t-test component allows more than two groups to be compared simultaneously.  Set selection for the multi-t-test is handled independently of the Arrays/Phenotypes component.  (This feature may be phased out, as it is not standard).  Its control panel shows each Array Set that is available.  It will compare in pairwise fashion all selected sets. Note that this differs from how sets are handled in the regular t-test just described above.  There, sets can be merged into either the case or control group.  Here, they are always treated independently.
+
===Technical Notes===
  
* A step-down Bonferonni type correction is used to account for multiple testing of markers.
+
* If two data points have exactly the same coordinates, only the point which is "on top" will be shown when clicked-on or moused over.
  
* A correction is also made for the pairwise comparison of multiple classes.
+
* If the graph has only one point, or has several points all with the exact same coordinates, the default JFreeChart graphing behavior may omit a scale on the X or Y axis.  The ranges of the axes and the labels can be manually adjusted.  Right-click on the X or Y axis label area and select Properties->Range.  Turn off "auto-ranging" and set the desired ranges.
  
* Results can be viewed in the Volcano Plot and in the Color Mosaic components.
+
==Color Mosaic Visualizer==
 +
The [[Color_Mosaic|Color Mosaic]] tab shows all arrays (or activated sets of arrays) and each significant marker with its p-value.  By default, the markers are sorted by p-value.  The display of each type of annotation can be switched on and off.
  
[[Image:T_multi-t-test_Set_selection_BCELL.png]]
+
Please see the [[Color_Mosaic|Color Mosaic]] tutorial for a complete description of all the features and controls of this viewer.
  
 +
In the figure below, all display options have been activated, displaying array names at top, and p-value, accession (marker name), and gene name at left.  The light-bulb indicates a hover text for the cell pointed to by the red arrow.  The hover text displays the array (Chip) name, the marker name, and the signal value.
  
The results of each pairwise t-test are placed into the Project Folder.  Each can be viewed by selecting it there.
 
[[Image:T_multi-t-test_ProjectFolders_result.png]]
 
  
==References==
+
[[Image:T-Test_Color_Mosaic_Control_Descriptions.png|{{ImageMaxWidth}}]]
t-test  [http://www.socialresearchmethods.net/kb/stat_t.htm]
 

Latest revision as of 15:02, 13 March 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Overview

For the geWorkbench web version of T-Test please see T-Test_web.

A t-Test analysis can be used to identify markers with statistically significant differential expression between two sets of microarrays. In geWorkbench, these groups are specified as the "Case" and "Control" sets.

There are several steps to setting up a t-test analysis in geWorkbench.

  1. At least two sets of arrays must be available in the Arrays component.
  2. The array sets to be used in the analysis must be "activated" by checking the box adjacent to their names in the Arrays component.
  3. One or more activated array sets must be designated "Case", and the others "Control" (which is the default classification).
  4. The t-test parameters must be set.


After the t-test is run, the results will be displayed graphically, and all markers meeting the significance threshold are placed into a new Marker Set called "Significant Genes".

Please see the Example section below for instructions on preparing array sets for the t-test analysis.

As of geWorkbench 2.4.0, the t-test result is calculated using the Apache Commons Math Library.

t-Test Parameters

P-value Parameters

p-values based on

The p-values can be calculated by transforming the t-statistic directly, or by carrying out a permutation analysis. The permutation analysis measures how often a t-statistic at least as large as that observed occurs by chance after array labels of case and control are permuted.

  • t-distribution (the default)
  • Permutation - If chosen, the number of permutations to carry out must also be specified.

Overall alpha (Critical p-value)

The threshold for a difference in expression between Case and Control sets being called significant. A value of 0.05 is often used for a single test. Multiple-testing corrections can be specified in the Alpha Corrections tab.

Data is Log2-transformed

If the dataset has been Log2 transformed, check this box. Having this information allows the fold-change displayed in the Volcano Plot to be calculated in a consistent fashion.

The system will examine the current dataset and make a guess as to whether the data has been log2 transformed. The user can override this guess using the check box.


T-test Pvalue params.png

Alpha corrections

Several methods for correcting for the effects of performing multiple tests are offered, including Bonferroni and False Discovery Rate control. They differ in how they compare the calculated p-value to the cutoff value of alpha - the critical p-value for determining the significance of an observed difference.

  • Just Alpha - No correction is performed.
  • Standard Bonferroni - The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values.
  • Adjusted (step-down) Bonferroni - Similar to the Bonferroni correction, but for each successive P-value in a list of p-values sorted in increasing order, the divisor for alpha is decremented by one and then the result compared with the P-value. The effect is to slightly reduce the stringency (increase the power) of the Bonferroni correction. This is a step-down procedure.
  • Two variants of the Westfall and Young method are available if the p-value is estimated by permuation:
    • minP
    • maxT


T-test alpha corrections.png

Degrees of Freedom

Group variances can be declared as:

  1. unequal (Welch approximation) (default)
  2. Equal.


T-test degrees of freedom.png

Example

Preparation

This example uses the file Bcell-100.zip, which also is contained in the data/public_data directory of the geWorkbench distribution, and is further described in the tutorial data Tutorial_Data area. A version with the threshold normalizer and log2 transformations described below already applied is also available there.

You may also wish to load the Affymetrix HG-U95Av2 annotation file, although it is not required for this example. See the FAQ section for information on downloading this file from Affymetrix.


For tips on loading data files, see Local Data Files and Workspace.

In this example, we apply two normalization steps to the data set.

  1. Threshold Normalizer - set a minimum value of 1. Any value less than 1 will be set to 1.
  2. Log2 Transformation Normalizer - Log2 transform the data.

For an actual data analysis, you should apply data normalization steps appropriate to your own data and analysis design.

Array Classification

The t-test in geWorkbench requires that at least two sets of arrays be "activated". Only such "activated" sets are considered. In addition, at least one such set must be designated as "Case", and at least one other as "Control" (which is the default classification). Note that more than one set of arrays can be marked as "Case" or control.

Array set classification is covered in the Arrays/Phenotypes chapter. However, for convenience, the steps are illustrated here.

The desired sets of arrays should be activated in the Arrays/Phenotypes component. This is done by checking the boxes by the desired Sets.

T-test Set activation BCell.png


The classification can be made directly by left-clicking on the "thumb-tack" icon adjacent to an array set name.

T-test Set classification left click Bcell.png


The array classification can also be set by right-clicking on the desired array set and selecting "Classification":


T-test Set classification right click Bcell.png


Using either method, the desired array set can be classified as "Case":


T-test Set selection BCell.png


The thumbtack image next to activated Array Sets is colored red.

Setting the Analysis Parameters

  1. In this example, prior to the t-test, the BCell-100.exp data was threshold normalized to a minimum value of 1, then log2 normalized.
  2. The t-test component should be loaded by default in the Component Configuration Manager.
  3. From the Analysis Panel, select T-Test Analysis.
  4. P-value Parameters tab:
    1. P-values based on t-distribution.
    2. Note that here the default alpha (critical p-value) is set to 0.01.
    3. If the data has been log2 transformed, check-mark the box "Data is log2 Transformed".
  5. Alpha-corrections tab
    1. Standard Bonferonni
  6. Degree of Freedom tab
    1. Welch approximation - unequal group variances.


The P-value Parameters tab set for the example analysis:


T-test Example setup.png

Running the t-test analysis

  1. Click Analyze. The results will be returned in three locations: The Workspace, the Markers component, and the Visualization area.

t-Test Results

Result Sets

A t-test result node is placed into the Workspace as a child of the microarray dataset that was analyzed.

The list of significant markers is placed into a new set in the Markers component. This set is labeled "Significant Genes". The number in square brackets indicates the number of markers in the set.

Saving a Result Set

The result node (in the Workspace) can be saved by right-clicking on it and selecting "Save". This will save the significant markers in a CSV (comma separated value) file with the following columns:

  • Probe Set Name
  • Gene Name
  • p-Value
  • Fold Change (Log2)

Markers found significant in the t-test, but for which a fold change value could not be calculated, are included in the export file as "N/A".

Fold Change

The method used to calculate fold change depends on whether the data was marked as log2 transformed or not during the t-test using the "Data is log2 transformed" box.

  • Linear data ("Data is log2 transformed" box was not checked): the fold change is calculated, for each marker, as the Log2 transform of the average expression in the Case set divided by the average expression in the control set, that is,
Log2(Avg(cases)/Avg(controls)) 
or 
Log2(Avg(cases)) - Log2(Avg(controls)).  (Difference of logs of averaged values).
  • Log2 transformed data ("Data is log2 transformed" box was checked): In this case, for each marker, the average of the (log) case values minus the average of the (log) control values is calculated, that is,
Avg(cases) - Avg(controls).  (Difference of averaged log values).

The fold change is not calculated if, for the linear case, the average case or control value is negative.

Volcano Plot Visualizer

Volcano plot.png


The Volcano Plot graphically depicts the results of the t-test for differential expression. It includes only markers which exceeded the threshold for significance in the t-test. The log2 fold change for each marker is plotted against the -log10 of the P-value.

Markers for which no valid fold-change value could be calculated (e.g. for the case of linear data the average of the case or control values was negative) are omitted from the Volcano Plot. However, all such markers are included if the data is exported to file.

See the Volcano Plot tutorial for further details.

Technical Notes

  • If two data points have exactly the same coordinates, only the point which is "on top" will be shown when clicked-on or moused over.
  • If the graph has only one point, or has several points all with the exact same coordinates, the default JFreeChart graphing behavior may omit a scale on the X or Y axis. The ranges of the axes and the labels can be manually adjusted. Right-click on the X or Y axis label area and select Properties->Range. Turn off "auto-ranging" and set the desired ranges.

Color Mosaic Visualizer

The Color Mosaic tab shows all arrays (or activated sets of arrays) and each significant marker with its p-value. By default, the markers are sorted by p-value. The display of each type of annotation can be switched on and off.

Please see the Color Mosaic tutorial for a complete description of all the features and controls of this viewer.

In the figure below, all display options have been activated, displaying array names at top, and p-value, accession (marker name), and gene name at left. The light-bulb indicates a hover text for the cell pointed to by the red arrow. The hover text displays the array (Chip) name, the marker name, and the signal value.


T-Test Color Mosaic Control Descriptions.png