Difference between revisions of "Tutorial - Reverse Engineering"

(Overview)
 
(13 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
{{TutorialsTopNav}}
 
{{TutorialsTopNav}}
  
 +
==Note==
 +
 +
Please see also the beta-release in geWorkbench of the [[Tutorial_-_ARACNE | ARACNE]] algorithm for network reverse engineering.  This is a more complete implementation than is present in the Reverse Engineering component.
 +
 +
==Outline==
 +
In this tutorial, we will cover:
 +
# Background to gene network reverse engineering
 +
# Calculate the mutual information between a hub gene and the others in a dataset
 +
# Calculate the interaction network around the hub gene.
 +
# Display the results in Cytoscape.
 +
# Select a subset of the interaction network (first-neighbors) and save it to a list.
  
 
==Overview==
 
==Overview==
The Reverse Engineering component included in geWorkbench is being rewritten (as of April 2006), and hence in a future release the details of the interface may change.  However, the functionality will be similar.
 
  
The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products.  The Reverse Engineering component uses the information theory concept of mutual information to find these interactions.  Mutual Information is in principle more sensitive and flexible than a simple correlation calculation.  It is also invariant under data transformations, so the details of normalization should not be important.
+
The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products.  The Reverse Engineering component uses the information theory concept of mutual information to find these interactions.  Mutual Information is in principle more sensitive and flexible than a simple correlation calculation.  It is also invariant under data transformations such as log transformation.
  
==Reverse Engineering in the context of geWorkbench==
 
 
The Reverse Engineering component calculates the information that the expression pattern of one gene carries about the expression of another gene, that is, it is a pairwise calculation.  Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support.  Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically  performed on large cluster computers and are not feasible on a desktop machine.  During 2006 we hope to provide a remote service that can host such calculations for jobs launched from geWorkbench.  However, at present, smaller scale calculations are supported directly in geWorkbench.
 
The Reverse Engineering component calculates the information that the expression pattern of one gene carries about the expression of another gene, that is, it is a pairwise calculation.  Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support.  Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically  performed on large cluster computers and are not feasible on a desktop machine.  During 2006 we hope to provide a remote service that can host such calculations for jobs launched from geWorkbench.  However, at present, smaller scale calculations are supported directly in geWorkbench.
  
 
As typically used in geWorkbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset.  In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pairwise MxM/2 mutual information calculation is performed between them.  The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component.
 
As typically used in geWorkbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset.  In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pairwise MxM/2 mutual information calculation is performed between them.  The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component.
 +
 +
==Limitations==
 +
# The exact algorithm used in Reverse Engineering has not been described in the literature.  geWorkbench now also includes a beta-release of the published [[Tutorial_-_ARACNE | ARACNE]] algorithm  for network reverse engineering.
 +
# The "Motif Location Histogram" only displays if the "All Arrays" checkbox is checked.  This is being fixed.
 +
# The Score vs. Probability graph does not work.  This is being fixed.
 +
# There is some overlap in functionality between Profiler tab and the Conditional Analysis tab.  Both calculate aspects of conditional interactions.
 +
# There is not yet a tutorial for the Conditional Analysis tab.
 +
# The images shown below were calculated using a slightly larger dataset than is currently being distributed with the tutorials.  '''The images will be updated soon.'''
  
 
==Prerequisites==
 
==Prerequisites==
A dataset  containing multiple arrays (the more the better) should be loaded into geWorkbench.  If data is loaded from separate files, it should be merged into a single microarray datset, either at the time of or after being read in.  See the section  [[Tutorial - Projects and Data Files | Projects and Data Files]].  In this tutorial we will load a dataset also used in other tutorial sections, which has been normalized, and filtered to reduce the number of genes.  This file, "webmatrix_quantile_log2_dev1_mv0.exp" will be made available in the tutorial data section.   It can also be obtained by starting with the file "webmatrix.exp" available in the ''downloads'' section and performing the following steps:
+
A dataset  containing multiple arrays (the more the better) should be loaded into geWorkbench.  If data is loaded from separate files, it should be merged into a single microarray datset, either at the time of or after being read in.  See the section  [[Tutorial - Projects and Data Files | Projects and Data Files]].  In this tutorial we will load a dataset also used in other tutorial sections, which has been normalized, and filtered to reduce the number of genes.  This file, "webmatrix_quantile_log2_dev1.2_mv0.exp" is available in the tutorial data section. Its derivation is described in [[Tutorial_-_Data | Data]]
 
 
1. Load the file webmatrix.exp.
 
 
 
2. Quantile Normalize.
 
 
 
3. Log2 transform (also in the Normalize tab).
 
 
 
4. Filter out values having deviation less than 1.
 
 
 
5. Remove markers with filtered-out values using the missing values filter with a threshold of 0.
 
  
 
==Example - Profiler==
 
==Example - Profiler==
* Load the data file "webmatrix_quantile_log2_dev1_mv0.exp".  This contains a set of 100 experiments on Affymetrix HG_U95Av2 chips.  As described above, this file has been quantile normalized, log2 converted, and then filtered to remove markers with a deviation of less than 1, in order to reduce the size of the dataset, leaving 3837 markers.
+
* Load the data file "webmatrix2_quantile_log2_dev1.2_mv0.exp".  This contains a set of 100 experiments on Affymetrix HG_U95Av2 chips.  As described above, this file has been quantile normalized, log2 converted, and then filtered to remove markers with a deviation of less than 1.2, in order to reduce the size of the dataset, leaving 2226 markers.
  
 
* In the upper right section of geWorkbench find the Reverse Engineering component.  It should by default be displaying the '''Profiler''' tab.
 
* In the upper right section of geWorkbench find the Reverse Engineering component.  It should by default be displaying the '''Profiler''' tab.
Line 38: Line 45:
 
* The default setting in the Profiler is '''Mutual Information (fast)'''.  With this selected, hit '''Analyze(2D)'''.  This will return a list of all markers having a MI score of greater than the cutoff value (the default is 0.2).
 
* The default setting in the Profiler is '''Mutual Information (fast)'''.  With this selected, hit '''Analyze(2D)'''.  This will return a list of all markers having a MI score of greater than the cutoff value (the default is 0.2).
  
 +
* Note - the following images were obtained using a slightly larger dataset, but the results are very similar.  New images will be provided soon.
  
 
[[Image:T_ReverseEngineering_Basic.png]]
 
[[Image:T_ReverseEngineering_Basic.png]]
  
  
After the Mutual Information algoritnm has been run, an adjacency matrix will be placed in the Projects Folder:
+
The next step is to create a network.  By default, the calculation will be performed for the top 100 scoring genes on list.
 +
 
 +
* If you wish to use a smaller set of genes, select just those you wish to include from the list by highlighting them.
 +
 
 +
* By right-clicking and selecting "Add to Set", the selected group will be added to the '''Markers''' component as a new set of markers which can be used in other components (sequence retrieval, annotation retriever etc.).  If no selection is made, up to the top 100 scoring genes will be added to the new set.
 +
 
 +
* Hit the '''Create Network''' button.  The Mutual Information algorithm will be run again on the selected markers, but this time it will include all pairwise combinations of the selected genes, not just each against the hub gene. Each gene is then connected via an edge with the gene it most strongly interacts with, with the chosen hub-gene at the center.
 +
 
 +
After network creation has been run, an adjacency matrix will be placed in the Projects Folder:
  
  
Line 48: Line 64:
  
  
* If at this point you hit the Create Network button, a network will be displayed based on the top 100 markers interacting with c-Myc.  As described above, the MI algoritm is run again on these M=100 markers, in order to measure interactions between each pair.  Each marker is then connected via an edge with the marker it most strongly interacts with, with the chosen hub-gene at the center.
+
* Select the '''Cytoscape''' visualizer if it is not already active.  The newly created network will appear similar to that shown here (we have selected the central hub-gene by clicking on it):
  
  
Line 54: Line 70:
  
  
This is best seen in '''Cytoscape''' by going to the '''Layout''' menu, and chosing '''yFiles->organic'''.  The layout will now appear as:
+
A better visualization can be created in '''Cytoscape''' by going to the '''Layout''' menu, and chosing '''yFiles->organic'''.  The layout will now appear as:
  
  
Line 60: Line 76:
  
  
* If a smaller list is desired, a set of markers can be highlighted in the list originally returned.  Only this selected subset, up to 100 markers, will then be used if "Create Network" is pressed.
+
* Within the network created in Cytoscape, one can select the central gene as already shown above, and then on the '''Cytoscape''' menu chose '''Select->Nodes->First Neighbors''' of selected nodes.   
 
 
* By right-clicking and selecting "Add to Set", this group will be added to the '''Markers''' component as a new set of markers which can be used in other components (sequence retrieval, annotation retriever etc.).
 
 
 
* Within the network created in Cytoscape, one can select the central gene, and then on the '''Cytoscape''' menu chose '''Select->Nodes->First Neighbors''' of selected nodes.   
 
  
 
[[Image:T_ReverseEngineering_SelectFirstNeighbors.png]]
 
[[Image:T_ReverseEngineering_SelectFirstNeighbors.png]]
Line 83: Line 95:
 
We can return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component.  If we select the first (highest MI score) marker on the list, the graph shown below is drawn in the '''Motif Location Histogram''' display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list.
 
We can return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component.  If we select the first (highest MI score) marker on the list, the graph shown below is drawn in the '''Motif Location Histogram''' display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list.
  
 +
#. The Motif Location Histogram plot is only drawn if you select "All Arrays" or activate a set of arrays in the Arrays/Phenotypes component.  If you select "All Arrays", then all the points will be one color:
  
 
[[Image:T_ReverseEngineering_MotifHistogram.png]]
 
[[Image:T_ReverseEngineering_MotifHistogram.png]]
 +
 +
 +
2. If you activate individual sets of arrays, each set will be drawn in a different color.
 +
 +
 +
[[Image:Motif_Location_Histogram_Sets.png]]
 +
  
 
==Options==
 
==Options==
  
 
'''Pearson''' - Uses a Pearson correlation function to calculate the interaction scores.
 
'''Pearson''' - Uses a Pearson correlation function to calculate the interaction scores.

Latest revision as of 18:08, 21 November 2006

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Note

Please see also the beta-release in geWorkbench of the ARACNE algorithm for network reverse engineering. This is a more complete implementation than is present in the Reverse Engineering component.

Outline

In this tutorial, we will cover:

  1. Background to gene network reverse engineering
  2. Calculate the mutual information between a hub gene and the others in a dataset
  3. Calculate the interaction network around the hub gene.
  4. Display the results in Cytoscape.
  5. Select a subset of the interaction network (first-neighbors) and save it to a list.

Overview

The primary use of the Reverse Engineering component is to infer regulatory interactions between genes and gene products. The Reverse Engineering component uses the information theory concept of mutual information to find these interactions. Mutual Information is in principle more sensitive and flexible than a simple correlation calculation. It is also invariant under data transformations such as log transformation.

The Reverse Engineering component calculates the information that the expression pattern of one gene carries about the expression of another gene, that is, it is a pairwise calculation. Larger datasets, containing more arrays per marker, will yield greater sensitivity and better statistical support. Full scale runs of reverse engineering algorithms, comparing all markers against each other, and typically done on datasets containing several hundred microarrays, are typically performed on large cluster computers and are not feasible on a desktop machine. During 2006 we hope to provide a remote service that can host such calculations for jobs launched from geWorkbench. However, at present, smaller scale calculations are supported directly in geWorkbench.

As typically used in geWorkbench, the Reverse Engineering component calculates the Mutual Information score between a single hub gene and all other N markers in the dataset. In a second step, a subset containing the best M markers is chosen (with a current limit of 100), and a complete pairwise MxM/2 mutual information calculation is performed between them. The network resulting from this calculation can be displayed as a branched tree of interactions within the Cytoscape component.

Limitations

  1. The exact algorithm used in Reverse Engineering has not been described in the literature. geWorkbench now also includes a beta-release of the published ARACNE algorithm for network reverse engineering.
  2. The "Motif Location Histogram" only displays if the "All Arrays" checkbox is checked. This is being fixed.
  3. The Score vs. Probability graph does not work. This is being fixed.
  4. There is some overlap in functionality between Profiler tab and the Conditional Analysis tab. Both calculate aspects of conditional interactions.
  5. There is not yet a tutorial for the Conditional Analysis tab.
  6. The images shown below were calculated using a slightly larger dataset than is currently being distributed with the tutorials. The images will be updated soon.

Prerequisites

A dataset containing multiple arrays (the more the better) should be loaded into geWorkbench. If data is loaded from separate files, it should be merged into a single microarray datset, either at the time of or after being read in. See the section Projects and Data Files. In this tutorial we will load a dataset also used in other tutorial sections, which has been normalized, and filtered to reduce the number of genes. This file, "webmatrix_quantile_log2_dev1.2_mv0.exp" is available in the tutorial data section. Its derivation is described in Data

Example - Profiler

  • Load the data file "webmatrix2_quantile_log2_dev1.2_mv0.exp". This contains a set of 100 experiments on Affymetrix HG_U95Av2 chips. As described above, this file has been quantile normalized, log2 converted, and then filtered to remove markers with a deviation of less than 1.2, in order to reduce the size of the dataset, leaving 2226 markers.
  • In the upper right section of geWorkbench find the Reverse Engineering component. It should by default be displaying the Profiler tab.
  • In the Markers component search box, on the left side of the geWorkbench interface, enter 1973 and hit enter. This will find the marker 1973_s_at, which is the c-Myc gene, a well-known transcription factor with many interactions. Click on this marker in the list. This will enter the marker into the Hub Gene Label field of the Profiler.


T Markers Search1973.png


  • The default setting in the Profiler is Mutual Information (fast). With this selected, hit Analyze(2D). This will return a list of all markers having a MI score of greater than the cutoff value (the default is 0.2).
  • Note - the following images were obtained using a slightly larger dataset, but the results are very similar. New images will be provided soon.

T ReverseEngineering Basic.png


The next step is to create a network. By default, the calculation will be performed for the top 100 scoring genes on list.

  • If you wish to use a smaller set of genes, select just those you wish to include from the list by highlighting them.
  • By right-clicking and selecting "Add to Set", the selected group will be added to the Markers component as a new set of markers which can be used in other components (sequence retrieval, annotation retriever etc.). If no selection is made, up to the top 100 scoring genes will be added to the new set.
  • Hit the Create Network button. The Mutual Information algorithm will be run again on the selected markers, but this time it will include all pairwise combinations of the selected genes, not just each against the hub gene. Each gene is then connected via an edge with the gene it most strongly interacts with, with the chosen hub-gene at the center.

After network creation has been run, an adjacency matrix will be placed in the Projects Folder:


T ProjectFolders AdjacencyMatrix.png


  • Select the Cytoscape visualizer if it is not already active. The newly created network will appear similar to that shown here (we have selected the central hub-gene by clicking on it):


T ReverseEngineering InitialNet.png


A better visualization can be created in Cytoscape by going to the Layout menu, and chosing yFiles->organic. The layout will now appear as:


T ReverseEngineering Central1973.png


  • Within the network created in Cytoscape, one can select the central gene as already shown above, and then on the Cytoscape menu chose Select->Nodes->First Neighbors of selected nodes.

T ReverseEngineering SelectFirstNeighbors.png


The first neighbors will be highlighted in the graph,


T ReverseEngineering FirstNeighbors.png


and also added as a new set in the Markers component.


T Markers ReverseEng Selected.png


We can return to the main Reverse Engineering component by clicking on the original dataset in the Project Folders component. If we select the first (highest MI score) marker on the list, the graph shown below is drawn in the Motif Location Histogram display. This shows a plot of the expression values on each array for the selected hub marker vs any other marker selected in the list.

  1. . The Motif Location Histogram plot is only drawn if you select "All Arrays" or activate a set of arrays in the Arrays/Phenotypes component. If you select "All Arrays", then all the points will be one color:

T ReverseEngineering MotifHistogram.png


2. If you activate individual sets of arrays, each set will be drawn in a different color.


Motif Location Histogram Sets.png


Options

Pearson - Uses a Pearson correlation function to calculate the interaction scores.