Difference between revisions of "MatrixREDUCE"
| m (→PSAM Detail Tab) |  (→PSAM Detail Tab) | ||
| Line 43: | Line 43: | ||
| '''Explanation of result table columns:''' | '''Explanation of result table columns:''' | ||
| − | '''Seed sequence''' - (Is this a consensus sequence for the seed matrix generated by the algorithm?) | + | '''Consensus Sequence''' - This displays the PSAM resulting from the analysis as an affinity logo.  The logo displays "the actual relative free energies of binding for each nucleotide at each position".  The horizontal line indicates the average delta-delta-G at each position, each letter is placed above or below the line depending on whether its delta-delta-G is more or less favorable than the average.  "The height of the letter can be interpreted as free energy difference from the average in units of RT."  The tallest letters thus contribute most to the sequence specificity of the motif. | 
| + | |||
| + | |||
| + | '''Experiment Name''' - ''explanation needed''. | ||
| + | |||
| + | '''Seed sequence''' - (''Is this a consensus sequence for the seed matrix generated by the algorithm?'') | ||
| '''F''' - F is a composite constant determined while fitting the models to the data.  See eq 12, pg. e143 of Foat et al. 2006 [http://www.ncbi.nlm.nih.gov/pubmed/16873464?ordinalpos=9&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum link to paper]. | '''F''' - F is a composite constant determined while fitting the models to the data.  See eq 12, pg. e143 of Foat et al. 2006 [http://www.ncbi.nlm.nih.gov/pubmed/16873464?ordinalpos=9&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum link to paper]. | ||
| − | '''t''' - a Pearson correlation value on the goodness of fit of the displayed optimal PSAM  (need more detail). | + | '''t''' - a Pearson correlation value on the goodness of fit of the displayed optimal PSAM  (''need more detail''). | 
| − | '''P''' - P-value corresponding to the significance of the t-value for the optimal PSAM vs average t-value of PSAMs obtained by resampling subsets of the data (Is this right?). | + | '''P''' - P-value corresponding to the significance of the t-value for the optimal PSAM vs average t-value of PSAMs obtained by resampling subsets of the data (''Is this right?''). | 
Revision as of 17:21, 29 April 2009
Contents
Outline
This tutorial contains
- an overview of MatrixREDUCE,
- descriptions of data types, parameters, and the graphical user interface, and
- an example of a simple run of the program.
Overview
MatrixREDUCE is a tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. It attempts to calculate a sequence-specific binding affinity for putative transcription factors. The sequence specificity of the transcription factors' DNA-binding domain is modeled using a position-specific affinity matrix (PSAM), representing the change in the binding affinity (Kd) whenever a specific position within a reference binding sequence is mutated. The resulting PSAM can be displayed as an affinity logo or as a consensus sequence.
For further details see MatrixREDUCE.
Data Files
MatrixREDUCE operates on two data files plus an optional topological pattern file.
The data files required are a microarray gene expression dataset and a FASTA file containing the DNA sequences corresponding to the regulatory region of genes probed in the microarray experiment. The gene/probe identifiers used in the microarray dataset and the sequence identifiers used in the FASTA file must match. However, case is not important; the program will change the identifiers to lower case before attempting to match the records. If an identifier appears more than once in a file, the last instance is used.
The topological pattern can either be specified by name or loaded from an external file (via the adjacent "Load" button).
Parameters
(default values are shown in parantheses)
P value - maximum p-value to accept a new motif (0.001).
Max Motif - Maximum # of motifs to search.
Strand - [("Auto-detect"), "Leading", "Reverse", "Both"]
Save run log - diagnostic messages from the run will be saved to a file.
Graphical Interface for results
PSAM Detail Tab
The PSAM detail tab displays the result PSAMs in a table format. Users can modify the display so that each PSAM is represented either by its sequence logo or its consensus sequence. Selected or all PSAMs can be exported to a text file using the "Export" or "Export All" buttons.
Explanation of result table columns:
Consensus Sequence - This displays the PSAM resulting from the analysis as an affinity logo. The logo displays "the actual relative free energies of binding for each nucleotide at each position". The horizontal line indicates the average delta-delta-G at each position, each letter is placed above or below the line depending on whether its delta-delta-G is more or less favorable than the average. "The height of the letter can be interpreted as free energy difference from the average in units of RT." The tallest letters thus contribute most to the sequence specificity of the motif.
Experiment Name - explanation needed.
Seed sequence - (Is this a consensus sequence for the seed matrix generated by the algorithm?)
F - F is a composite constant determined while fitting the models to the data. See eq 12, pg. e143 of Foat et al. 2006 link to paper.
t - a Pearson correlation value on the goodness of fit of the displayed optimal PSAM (need more detail).
P - P-value corresponding to the significance of the t-value for the optimal PSAM vs average t-value of PSAMs obtained by resampling subsets of the data (Is this right?).
The parameters "t" and "P" shown in the output summary table are further described in Foat et al. (2006), pg e144 link to paper as follows:
"MatrixREDUCE uses a k-fold cross-validation to determine the significance of each discovered PSAM. After converging on a PSAM, the input data is split into k random subsets of array features with associated sequences. The optimal PSAM is then used to seed each of k re-optimizations of the PSAM. A t-value (Pearson correlation) for the goodness of fit is calculated for the optimal PSAM of each subset. Finally, the P-value corresponding to the average t-value for the k re-optimizations is used to test whether the originally optimized PSAM should be kept. This procedure does not test the significance of the optimal PSAM itself, but rather it tests whether the data contains widely distributed, explainable variance".
Sequence Tab
The sequence tab displays the DNA sequences of the regulatory region associated with each gene that is probed. Users can visualize the matching scores of PSAMs against each sequence. Sequence score is the product of weights w across all positions in the sliding window.
When a PSAM is selected, its weighted score is displayed graphically on the input (upstream) sequences (Affinity score graph). The system computes for every position an aggregate affinity score for all selected PSAMs and plots the scores along all sequence positions. Each score is between 0 and 1. Only scores larger than the designated cut-off threshold are drawn.
Example
In this example, we will use two files that are included in the data directory of the geWorkbench distribution iteself. They are SpellmanReduced.txt and Y5_600_Bst.fa. SpellmanReduced.txt contains a subset of the data from Spellman 1998. Y5_600_Bst.fa contains the corresponding upstream DNA sequences for these genes.
1. In the Project Folders component, either use an existing Project, or create a new one.
2. Right-click on Project and select "Open File(s)".
3. Browse to the file SpellmanReduced.txt and set the file type to "Tab-Delimted". This file is found in the data directory of the geWorkbench installation. Open the file.
4. You will be asked for an annotations file. This is not needed for this example, so you can hit Cancel.
5. Go to the Analysis tab and select Matrix Reduce.
6. To load the sequence file, click the "Load..." button. Browser to the sequence file "Y5_600_Bst.fa" and open it.
7. Click Analyze to run MatrixREDUCE. (If you are running geWorkbench from a console window using ANT, you can follow the progress of the calculations there).
8. The result is placed as a node beneath the parent microarray dataset in the Project Folders component. At the same time, the results are displayed in the Visual Area of geWorkbench.
There are two tabs in the viewer, PSAM Detail and Sequence. Within PSAM Detail there are two options. The first is the Image view, which depicts the PSAM graphically.
The second viewing option is the Name view, which just shows the consensus sequence without the weighted components.
Finally, the Sequence tab depicts scores along each sequence.
References
Foat BC et al., (2005). Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. PNAS 102(49), 17675-17680.
Foat BC, Morozov AV, Bussemaker HJ. (2006). Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22(14):e141-e149. link to paper
Spellman et al., (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell 9, 3273-3297.




