Difference between revisions of "Hierarchical Clustering"

(Right-click menu items)
(Selecting a subtree)
Line 183: Line 183:
  
  
Clicking on the selected area will cause only this area to be displayed.
+
Clicking on the selected area will cause only this area to be displayed, as shown below.
  
  
 
[[Image:T_HC_Dendrogram_selection.png]]
 
[[Image:T_HC_Dendrogram_selection.png]]
 
 
  
 
====Working with a subtree in the Dendrogram====
 
====Working with a subtree in the Dendrogram====

Revision as of 15:56, 29 July 2009

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Overview

geWorkbench implements its own code for agglomerative hierarchical clustering. Starting from individual points (the leaves of the tree), nearest neighbors are found for individual points, and then for groups of points, at each step building up a branched structure that converges toward a root that contains all points. The resulting graph tends to group similar items together.

Results of hierarchical clustering are displayed in the Dendrogram component, which is further described below.

Prerequisites

Dataset

A microarray dataset must be loaded in the Project Folders component. If an annotation file is also loaded corresponding to the microarray type, then gene names will be used in the results display, otherwise probeset names will be used.

Note - hierarchical clustering is memory intensive. With the default memory settings (see here to change), clustering more than about 2000 markers is not recommended.


Missing Values

If there are any missing values in the dataset, an error message will be returned if Hierarchical Clustering is run. Missing values can be filtered out or replaced using the Missing Value Filter or the Missing Value Normalizer.

Parameters

T HC default settings.png


Clustering Method

Specifies the method used to measure the distance between two clusters as the hierarchical tree is constructed. In each, the distance from each point in one cluster to each point in a second cluster is calculated, giving a set of pairwise distances. The methods differ in how the overall distance between the clusters is measured:

Single Linkage

The smallest pairwise distance found between the two clusters is used.

This method often leads to a "chaining" effect and is usually not recommended.

Average Linkage

The average of all the pairwise distances between the two clusters is used.

Total Linkage

The largest pairwise distance found between the two clusters is used.

Clustering Dimension

Marker

Cluster the selected markers (genes) only based on the similarity across selected microarrays.

Microarray

Cluster the selected microarrays only based on the similarity across selected markers.

Both

Cluster both markers and microarrays.

Clustering Metric

The values being clustered, whether markers or microarrays, can each be represented by vectors of numbers, essentially either rows (markers) or columns (microarrays) taken from a spreadsheet view of all expression values. Several methods by which to calculate the distance between any two vectors are offered:

Euclidean

The direct, point-to-point distance is calculated (square root of the sum of square differences).

Pearson's

Pearson's correlation coefficient for two vectors is calculated.

Spearman's

Spearman's rank correlation coefficient for two vectors is calculated.

Set Selection

Most geWorkbench analysis components provide the "All Arrays" and "All Markers" check-boxes to allow any activated sets of arrays or markers to be overridden. Normally, activating one or more sets of markers or arrays limits an analysis to those items in the active set(s).

All Arrays

Use all arrays in the dataset.

All Markers

Use all markers in the dataset.

Analysis Actions

This component uses the standard analysis component framework, which provides three buttons:

Analyze

Start the clustering job.

Save Settings

Save the current settings to a named entry in the settings list.

Delete Settings

Delete the selected setting entry from the list.

A separate tutorial will be devoted to these common actions.

Services

Several geWorkbench analysis routines have been implemented as analytical services available on caGrid. On the services tab, the user can select whether to run a job locally or using a grid service. A separate section will be devoted to explaining common aspects of grid services.


The Dendrogram visual component

Hierarchical clustering results are displayed in the Dendrogram component.

T HC Dendrogram display.png

Controls

Enable Selection

Checking this box allow a subtree of the dendrogram to be selecting interactively using the mouse. The selected area is highlighted in blue. Clicking on the selected area will restrict the display to just the selected portion of the tree.

Gene Height

Sets the height in pixels of the rows devoted to each marker and associated labels.

Gene Width

Sets the width in pixels of the columns devoted to each array. Label text is also scaled proportionately.

Color key

Shows the range of color values from lowest to highest expression for the current display preference. See the Color Mosaic tutorial for further details on the absolute and relative display preference settings.

Intensity slider

The intensity slider adjust the midpoint of the color scale to lower or higher expression values.

Bulb Icon

Pushing the bulb icon activates a tool-tip feature on the dendrogram display. Mousing over the dendrogram will bring up a display of the following information for any point:

  • the array name
  • the marker/gene name
  • the expression value of the selected marker on the selected array.

T Color Mosaic Tooltip.png

Left-click actions

Left clicking on any point on the dendrogram will highlight the selected marker in the Markers component.

Right-click menu items

T HC Dendrogram add-to-set.png

Example

Running the calculation

This example will take off with the set of markers produced in the ANOVA example. Please follow the steps for that example to produce the starting marker set, or just create/select another set of markers of your own.

1. If following the ANOVA example, activate the set of markers labeled "Significant Genes [1786]" ( which contains 1786 markers).


T HC set activation.png


2. Set the parameters as shown in the following figure.


T HC setup.png

  • Clustering methods: Average Linkage.
  • Clustering Dimension: Marker.
  • Clustering Metric: Euclidean.

3. Click Analyze.

4. A progress bar will be visible during the calculation, first displaying a message regarding computing distances....


T HC computing.png


... and then about clustering.


[[T HC Clustering message.png



The results are placed in the Project Folders component and labeled "Hierarchical Clustering", and can be displayed in the Dendrogram component.


Displaying results in the Dendrogram component

The following figure show a close-up of the resulting dendrogram. The four horizontal bars shown in the diagram were added just to show the boundaries of the four array sets used. (Note - you do not need to activate sets for clustering, unless you wish to use only a subset of all available arrays or markers).


T HC Dendrogram marked.png

Selecting a subtree

The Dendrogram component allows one to select and work with just a portion of the displayed tree. To activate this feature, check the Enable Selection checkbox at lower left in the Dendrogram component.


T HC Dendrogram EnableSelection.png

Now when the cursor is moved over the displayed tree, a blue rectangle will indicate which portion of the tree is currently sub-selected.


T HC Dendrogram selecting.png


Clicking on the selected area will cause only this area to be displayed, as shown below.


T HC Dendrogram selection.png

Working with a subtree in the Dendrogram

Right-clicking on the displayed sub-tree will show a menu with two entries:

  • Image Snapshot - place a static snapshot of the tree as currently displayed into Project Folders component.
  • Add to Set - add the markers represented in the currently displayed tree to a new set in the Markers component called "Cluster Tree".

T HC Dendrogram add-to-set.png


The following figure shows the the new set of markers (labeled "Cluster Tree") after it has been added to the Markers component.


T HC MarkerSets-ClusterTree.png