Difference between revisions of "ANOVA"
(→ANOVA Parameters and Settings) |
|||
Line 1: | Line 1: | ||
{{TutorialsTopNav}} | {{TutorialsTopNav}} | ||
− | + | =Overview= | |
The geWorkbench ANOVA component utilizes a one-way analysis of variance calculation from TIGR's MEV package. At least three groups of arrays must be specified by defining and activating them in the Arrays/Phenotypes component. For each chosen marker the routine determines if, at the specified level of significance, any difference exists in expression values between any of the groups (the null hypothesis is that there is no difference between the groups). Several basic methods of multiple testing correction are offered. The analysis does not indicate between which groups the difference is found, only that one exists. | The geWorkbench ANOVA component utilizes a one-way analysis of variance calculation from TIGR's MEV package. At least three groups of arrays must be specified by defining and activating them in the Arrays/Phenotypes component. For each chosen marker the routine determines if, at the specified level of significance, any difference exists in expression values between any of the groups (the null hypothesis is that there is no difference between the groups). Several basic methods of multiple testing correction are offered. The analysis does not indicate between which groups the difference is found, only that one exists. | ||
Those markers for which a significant difference is found are placed into a new set in the Markers component called "Significant Genes". The results are also display as a heat map in the Color Mosaic component. | Those markers for which a significant difference is found are placed into a new set in the Markers component called "Significant Genes". The results are also display as a heat map in the Color Mosaic component. | ||
− | + | =ANOVA Parameters and Settings= | |
* To use the ANOVA routine, first check that it has been loaded in the [[Tutorial_-_Component_Configuration_Manager | Component Configuration Manager]]. | * To use the ANOVA routine, first check that it has been loaded in the [[Tutorial_-_Component_Configuration_Manager | Component Configuration Manager]]. | ||
Line 15: | Line 15: | ||
[[Image:T_ANOVA_default_settings.png]] | [[Image:T_ANOVA_default_settings.png]] | ||
− | + | ==P-Value Estimation== | |
− | + | ===F-Distribution=== | |
The F-distribution arises from the ratio of the variances of two normally distributed statistics (chi-squared distributions). | The F-distribution arises from the ratio of the variances of two normally distributed statistics (chi-squared distributions). | ||
− | + | ===Permutation=== | |
This selection requires the number of permutations. The default value is 100. | This selection requires the number of permutations. The default value is 100. | ||
− | + | ==Multiple testing corrections== | |
The p-value can be used directly as a cutoff for determining significant genes, or it can be corrected. | The p-value can be used directly as a cutoff for determining significant genes, or it can be corrected. | ||
− | + | ===Just Alpha=== | |
No correction is performed. | No correction is performed. | ||
− | + | ===Standard Bonferroni=== | |
The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values. | The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values. | ||
− | + | ===Adjusted Bonferroni=== | |
Implements the Holm-Bonferroni correction. For each successive P-value in an ordered list of P-values, the divisor for alpha is decremented by one and then the result compared with the P-value. The effect is to slightly reduce the stringency of the Bonferroni correction. | Implements the Holm-Bonferroni correction. For each successive P-value in an ordered list of P-values, the divisor for alpha is decremented by one and then the result compared with the P-value. The effect is to slightly reduce the stringency of the Bonferroni correction. | ||
− | + | ===Westfall-Young Step Down=== | |
(This correction is only available when the permutation method is chosen for calculating p-values). | (This correction is only available when the permutation method is chosen for calculating p-values). | ||
− | + | ===False Discovery Control=== | |
(This correction is only available when the permutation method is chosen for calculating p-values). | (This correction is only available when the permutation method is chosen for calculating p-values). | ||
Line 45: | Line 45: | ||
* An upper limit on the proportion of false positives. | * An upper limit on the proportion of false positives. | ||
− | + | ==Note on saving settings== | |
The False Discovery Control parameter fields will only have their values saved if they are actually selected. As they are controlled by radio buttons, only one field can be active at one time, and hence only at most one of those fields will be saved in any one set. | The False Discovery Control parameter fields will only have their values saved if they are actually selected. As they are controlled by radio buttons, only one field can be active at one time, and hence only at most one of those fields will be saved in any one set. | ||
− | + | =Services (Grid)= | |
ANOVA can be run either locally within geWorkbench, or remotely as a grid job on caGrid. See the [[Tutorial_-_Grid_Services | Grid Services]] section for further details on setting up a grid job. | ANOVA can be run either locally within geWorkbench, or remotely as a grid job on caGrid. See the [[Tutorial_-_Grid_Services | Grid Services]] section for further details on setting up a grid job. | ||
− | + | =Viewing ANOVA Results= | |
When an ANOVA result node is selected (highlighted) in the project panel, the results will be displayed in both a tabular form and in the form of a heatmap in the Color Mosaic component. | When an ANOVA result node is selected (highlighted) in the project panel, the results will be displayed in both a tabular form and in the form of a heatmap in the Color Mosaic component. | ||
− | + | ==Color Mosaic== | |
In this view, a color spectrum is used to indicate the relative magnitudes of the measurements. The arrays (columns) are grouped by input group membership, i.e. group 1, group 2 etc. Each row corresponds to a marker, and markers display is ordered by p-value ascending order (from most significant to least significant). This visualization is described in detail in the Color Mosaic Help and Tutorials. | In this view, a color spectrum is used to indicate the relative magnitudes of the measurements. The arrays (columns) are grouped by input group membership, i.e. group 1, group 2 etc. Each row corresponds to a marker, and markers display is ordered by p-value ascending order (from most significant to least significant). This visualization is described in detail in the Color Mosaic Help and Tutorials. | ||
− | + | ==Tabular View== | |
This Visual Area component displays a read-only spreadsheet view of the significant genes sorted by p-value in ascending order (from most significant to least significant). | This Visual Area component displays a read-only spreadsheet view of the significant genes sorted by p-value in ascending order (from most significant to least significant). | ||
Line 70: | Line 70: | ||
* Export: Click on Export in the lower left of the visualization to export this table in .csv format. The export file will contain only the columns displayed. | * Export: Click on Export in the lower left of the visualization to export this table in .csv format. The export file will contain only the columns displayed. | ||
− | + | =Example of running ANOVA= | |
− | + | ==Loading and preparing the example data== | |
This example uses the [[Media:Bcell-100.zip|Bcell-100.zip]] dataset available and further described on the [[Download]] page. Briefly, this dataset is composed of 100 Affymetrix HG-U95Av2 arrays on which various B-cell lines, both normal and cancerous, were analyzed. Thus it explores a potentially wide variety of expression phenotypes. In order to see the gene names associated with the probesets, one must also download the correct annotation file for this array from the Affymetrix Netaffx website. | This example uses the [[Media:Bcell-100.zip|Bcell-100.zip]] dataset available and further described on the [[Download]] page. Briefly, this dataset is composed of 100 Affymetrix HG-U95Av2 arrays on which various B-cell lines, both normal and cancerous, were analyzed. Thus it explores a potentially wide variety of expression phenotypes. In order to see the gene names associated with the probesets, one must also download the correct annotation file for this array from the Affymetrix Netaffx website. | ||
Line 82: | Line 82: | ||
# The data was then log2 transformed (under Normalization). This gives the data a more "normal" distribution. | # The data was then log2 transformed (under Normalization). This gives the data a more "normal" distribution. | ||
− | + | ==Choosing array groups== | |
The Bcell-100 dataset comes with predefined sets of arrays. In the Arrays/Phenotypes component (at lower left in the geWorkbench GUI), choose the group in the pulldown menu called "Class". Check the box beside each of the four sets of arrays to activate them as shown in the figure below. | The Bcell-100 dataset comes with predefined sets of arrays. In the Arrays/Phenotypes component (at lower left in the geWorkbench GUI), choose the group in the pulldown menu called "Class". Check the box beside each of the four sets of arrays to activate them as shown in the figure below. | ||
Line 88: | Line 88: | ||
[[Image:T_ANOVA_array_groups.png]] | [[Image:T_ANOVA_array_groups.png]] | ||
− | + | ==Setting up the Parameters and running ANOVA== | |
Line 100: | Line 100: | ||
[[Image:T_ANOVA_setup.png]] | [[Image:T_ANOVA_setup.png]] | ||
− | + | =ANOVA result display= | |
When the ANOVA calculation completes, the result node is placed in the Project Folders component. Also shown is a "snapshot" node, that is a static picture of the heat map, labeled "Color Mosaic View". It was produced by right-clicking on the Color Mosaic display (see below) and selecting "Take Snapshot". | When the ANOVA calculation completes, the result node is placed in the Project Folders component. Also shown is a "snapshot" node, that is a static picture of the heat map, labeled "Color Mosaic View". It was produced by right-clicking on the Color Mosaic display (see below) and selecting "Take Snapshot". | ||
Line 115: | Line 115: | ||
− | + | ==Tabular Viewer== | |
The Tabular Viewer shows the ANOVA results in spreadsheet format. The table can be sorted on any column by clicking on its header label. | The Tabular Viewer shows the ANOVA results in spreadsheet format. The table can be sorted on any column by clicking on its header label. | ||
Line 131: | Line 131: | ||
[[Image:T_ANOVA_Tabular_Viewer.png]] | [[Image:T_ANOVA_Tabular_Viewer.png]] | ||
− | + | ==Significant markers set== | |
All markers which met the threshold p-value (alpha) cutoff are placed into the "Significant Genes" set in the '''Markers''' component. Such sets of markers can be used as the starting point for further characterization and analysis. | All markers which met the threshold p-value (alpha) cutoff are placed into the "Significant Genes" set in the '''Markers''' component. Such sets of markers can be used as the starting point for further characterization and analysis. | ||
[[Image:T_ANOVA_significant_genes_set.png]] | [[Image:T_ANOVA_significant_genes_set.png]] |
Revision as of 12:54, 6 October 2009
Contents
Overview
The geWorkbench ANOVA component utilizes a one-way analysis of variance calculation from TIGR's MEV package. At least three groups of arrays must be specified by defining and activating them in the Arrays/Phenotypes component. For each chosen marker the routine determines if, at the specified level of significance, any difference exists in expression values between any of the groups (the null hypothesis is that there is no difference between the groups). Several basic methods of multiple testing correction are offered. The analysis does not indicate between which groups the difference is found, only that one exists.
Those markers for which a significant difference is found are placed into a new set in the Markers component called "Significant Genes". The results are also display as a heat map in the Color Mosaic component.
ANOVA Parameters and Settings
- To use the ANOVA routine, first check that it has been loaded in the Component Configuration Manager.
- ANOVA will be found in the list of loaded analysis routines in the lower-right Commands quadrant of geWorkbench.
P-Value Estimation
F-Distribution
The F-distribution arises from the ratio of the variances of two normally distributed statistics (chi-squared distributions).
Permutation
This selection requires the number of permutations. The default value is 100.
Multiple testing corrections
The p-value can be used directly as a cutoff for determining significant genes, or it can be corrected.
Just Alpha
No correction is performed.
Standard Bonferroni
The cutoff value (alpha) is divided by the number of tests (genes) before being compared with the calculated p-values.
Adjusted Bonferroni
Implements the Holm-Bonferroni correction. For each successive P-value in an ordered list of P-values, the divisor for alpha is decremented by one and then the result compared with the P-value. The effect is to slightly reduce the stringency of the Bonferroni correction.
Westfall-Young Step Down
(This correction is only available when the permutation method is chosen for calculating p-values).
False Discovery Control
(This correction is only available when the permutation method is chosen for calculating p-values).
The user must select the false discoveries in terms of either
- An upper limit on the number of false positives (markers falsely called as showing a significant difference), or
- An upper limit on the proportion of false positives.
Note on saving settings
The False Discovery Control parameter fields will only have their values saved if they are actually selected. As they are controlled by radio buttons, only one field can be active at one time, and hence only at most one of those fields will be saved in any one set.
Services (Grid)
ANOVA can be run either locally within geWorkbench, or remotely as a grid job on caGrid. See the Grid Services section for further details on setting up a grid job.
Viewing ANOVA Results
When an ANOVA result node is selected (highlighted) in the project panel, the results will be displayed in both a tabular form and in the form of a heatmap in the Color Mosaic component.
Color Mosaic
In this view, a color spectrum is used to indicate the relative magnitudes of the measurements. The arrays (columns) are grouped by input group membership, i.e. group 1, group 2 etc. Each row corresponds to a marker, and markers display is ordered by p-value ascending order (from most significant to least significant). This visualization is described in detail in the Color Mosaic Help and Tutorials.
Tabular View
This Visual Area component displays a read-only spreadsheet view of the significant genes sorted by p-value in ascending order (from most significant to least significant).
Basic Navigation
- Resize columns by using the mouse to drag column headings.
- Reorder columns in the details pane by using a mouse to drag a column heading to the left or right of its original position. As you drag a column, highlighting between the column headings indicates the new position of the column.
- To sort by a specific column, double click on the column header to sort ascending and then click again to sort in descending order
- Display Preference: Display Preference button to modify the visible columns.
- Export: Click on Export in the lower left of the visualization to export this table in .csv format. The export file will contain only the columns displayed.
Example of running ANOVA
Loading and preparing the example data
This example uses the Bcell-100.zip dataset available and further described on the Download page. Briefly, this dataset is composed of 100 Affymetrix HG-U95Av2 arrays on which various B-cell lines, both normal and cancerous, were analyzed. Thus it explores a potentially wide variety of expression phenotypes. In order to see the gene names associated with the probesets, one must also download the correct annotation file for this array from the Affymetrix Netaffx website.
- Obtain the Bcell-100.exp dataset. If downloaded as a zip file, unzip it to a convenient directory.
- (Optional) Obtain the annotation file for the HG-U95Av2 array type from Affymetrix. The name will be similar to "HG_U95Av2.na28.annot.csv", where na28 is the version number.
- Load the Bcell-100.exp dataset into geWorkbench as type "Affymetrix File Matrix". (See Loading from File).
- When prompted, and if desired, load the annotation file.
- For this example, the data was subjected to quantile normalization.
- The data was then log2 transformed (under Normalization). This gives the data a more "normal" distribution.
Choosing array groups
The Bcell-100 dataset comes with predefined sets of arrays. In the Arrays/Phenotypes component (at lower left in the geWorkbench GUI), choose the group in the pulldown menu called "Class". Check the box beside each of the four sets of arrays to activate them as shown in the figure below.
Setting up the Parameters and running ANOVA
For this example we wil apply a relatively stringent multiple testing correction.
- Leave the P-value method set to F-distribution.
- Set the P-Value Threshold (alpha) to 0.01.
- For the P-value correction choose Standard Bonferroni.
- Push the Analyze button.
ANOVA result display
When the ANOVA calculation completes, the result node is placed in the Project Folders component. Also shown is a "snapshot" node, that is a static picture of the heat map, labeled "Color Mosaic View". It was produced by right-clicking on the Color Mosaic display (see below) and selecting "Take Snapshot".
Two visualizations are available for ANOVA results, the Color Mosaic and the Tabular View.
Color Mosaic
The Color Mosaic view displays the results as a heat map. Each activated set of arrays is labeled at the top of the figure, and the markers are sorted in order of the p-value. The heat map is colored using the currently selected color scheme (Menu->Tools->Preferences->Visualization). A color bar at the bottom shows the range of the color display and its correlation with expression values. Further details are available in the Color Mosaic tutorial.
Tabular Viewer
The Tabular Viewer shows the ANOVA results in spreadsheet format. The table can be sorted on any column by clicking on its header label.
- Marker Name - Shows the gene name if an annotation file has been loaded, otherwise shows the probset name.
The Display Preference button brings up a panel which controls which columns to display. The available column choices are:
- F-statistic - the raw ANOVA score for each marker.
- P-value - the probability of observing an F-statistic this large by chance alone, assuming the null hypothesis of no actual differences between sets of arrays.
- Mean - the mean expression value for each set.
- Std - the standard deviation for each set.
Significant markers set
All markers which met the threshold p-value (alpha) cutoff are placed into the "Significant Genes" set in the Markers component. Such sets of markers can be used as the starting point for further characterization and analysis.