geWorkbench

Home \| Quick Start \| Basics \| Menu Bar \| Preferences \| Component Configuration Manager \| Workspace \| Information Panel \| Local Data Files \| File Formats \| caArray \| Array Sets \| Marker Sets \| Microarray Dataset Viewers \| Filtering \| Normalization \| Tutorial Data \| geWorkbench-web Tutorials	Analysis Framework \| ANOVA \| ARACNe \| BLAST \| Cellular Networks KnowledgeBase \| CeRNA/Hermes Query \| Classification (KNN, WV) \| Color Mosaic \| Consensus Clustering \| Cytoscape \| Cupid \| DeMAND \| Expression Value Distribution \| Fold-Change \| Gene Ontology Term Analysis \| Gene Ontology Viewer \| GenomeSpace \| genSpace \| Grid Services \| GSEA \| Hierarchical Clustering \| IDEA \| Jmol \| K-Means Clustering \| LINCS Query \| Marker Annotations \| MarkUs \| Master Regulator Analysis \| (MRA-FET Method) \| (MRA-MARINa Method) \| MatrixREDUCE \| MINDy \| Pattern Discovery \| PCA \| Promoter Analysis \| Pudge \| SAM \| Sequence Retriever \| SkyBase \| SkyLine \| SOM \| SVM \| T-Test \| Viper Analysis \| Volcano Plot

Outline

This tutorial contains

an overview of MatrixREDUCE,
descriptions of data types, parameters, and the graphical user interface, and
an example of a simple run of the program.

Overview

MatrixREDUCE is a tool for inferring cis-regulatory elements and transcriptional module activities from microarray data. It attempts to calculate a sequence-specific binding affinity for putative transcription factors. The sequence specificity of the transcription factors' DNA-binding domain is modeled using a position-specific affinity matrix (PSAM), representing the change in the binding affinity (Kd) whenever a specific position within a reference binding sequence is mutated. The resulting PSAM can be displayed as an affinity logo or as a consensus sequence.

For further details see STV webpage for MatrixREDUCE.

Data Files

MatrixREDUCE operates on two data files plus an optional topological pattern file.

The data files required are a microarray gene expression dataset and a FASTA file containing the DNA sequences corresponding to the regulatory region of genes probed in the microarray experiment. The gene/probe identifiers used in the microarray dataset and the sequence identifiers used in the FASTA file must match. However, case is not important; the program will change the identifiers to lower case before attempting to match the records. If an identifier appears more than once in a file, the last instance is used.

The topological pattern can either be specified by name or loaded from an external file (via the adjacent "Load" button).

Parameters

(default values are shown in parantheses)

P value - p-value threshold at which to stop looking for new motifs (0.001).

Max Motif - Maximum # of motifs to search.

Strand - [("Auto-detect"), "Leading", "Reverse", "Both"]

Save run log - diagnostic messages from the run will be saved to a file.

Graphical Interface for results

PSAM Detail Tab

The PSAM detail tab displays the result PSAMs in a table format. Users can modify the display so that each PSAM is represented either by its sequence logo or its consensus sequence. Selected or all PSAMs can be exported to a text file using the "Export" or "Export All" buttons. Explanation of result table columns:

F - F is a composite constant determined while fitting the models to the data. See eq 12, pg. e143 of Foat, Morozov and Bussemaker 2006 link to paper.

t - a Pearson correlation value on the goodness of fit of the PSAM (is it the t-value for the optimal PSAM or is it the average t-value mentioned in the quoted description below?)

P - P-value corresponding to the average t-value (does not correspond to the optimal PSAM).

The parameters "t" and "P" shown in the output summary table are further described in Foat et al (2006) as follows:

"MatrixREDUCE uses a k-fold cross-validation to determine the significance of each discovered PSAM. After converging on a PSAM, the input data is split into k random subsets of array features with associated sequences. The optimal PSAM is then used to seed each of k re-- optimizations of the PSAM. A t-value (Pearson correlation) for the goodness of fit is calculated for the optimal PSAM of each subset. Finally, the P-value corresponding to the average t-value for the k re-optimizations is used to test whether the originally optimized PSAM should be kept. This procedure does not test the significance of the optimal PSAM itself, but rather it tests whether the data contains widely distributed, explainable variance.

Sequence Tab

The sequence tab displays the DNA sequences of the regulatory region associated with each gene that is probed. Users can visualize the matching scores of PSAMs against each sequence. Sequence score is the product of weights w across all positions in the sliding window.

When a PSAM is selected, its weighted score is displayed graphically on the input (upstream) sequences (Affinity score graph). The system computes for every position an aggregate affinity score for all selected PSAMs and plots the scores along all sequence positions. Each score is between 0 and 1. Only scores larger than the designated cut-off threshold are drawn.

Example

In this example, we will use two files that are included in the data directory of the geWorkbench distribution iteself. They are SpellmanReduced.txt and Y5_600_Bst.fa. SpellmanReduced.txt contains a subset of the data from Spellman 1998. Y5_600_Bst.fa contains the corresponding upstream DNA sequences for these genes.

1. In the Project Folders component, either use an existing Project, or create a new one.

2. Right-click on Project and select "Open File(s)".

3. Browse to the file SpellmanReduced.txt and set the file type to "Tab-Delimted". This file is found in the data directory of the geWorkbench installation. Open the file.

4. You will be asked for an annotations file. This is not needed for this example, so you can hit Cancel.

5. Go to the Analysis tab and select Matrix Reduce.

6. To load the sequence file, click the "Load..." button. Browser to the sequence file "Y5_600_Bst.fa" and open it.

7. Click Analyze to run MatrixREDUCE. (If you are running geWorkbench from a console window using ANT, you can follow the progress of the calculations there).

8. The result is placed as a node beneath the parent microarray dataset in the Project Folders component. At the same time, the results are displayed in the Visual Area of geWorkbench.

There are two tabs in the viewer, PSAM Detail and Sequence. Within PSAM Detail there are two options. The first is the Image view, which depicts the PSAM graphically.

Foat BC, Morozov AV, Bussemaker HJ. (2006) Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22(14):e141-9.

The second viewing option is the Name view, which just shows the consensus sequence without the weighted components.

Finally, the Sequence tab depicts scores along each sequence.

References

Foat BC et al., (2005). Profiling condition-specific, genome-wide regulation of mRNA stability in yeast. PNAS 102(49), 17675-17680.

Foat BC, Morozov AV, Bussemaker HJ. (2006). Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22(14):e141-e149.

Spellman et al., (1998). Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Molecular Biology of the Cell 9, 3273-3297.

geWorkbench

MatrixREDUCE

Contents