Difference between revisions of "Hierarchical Clustering"

(Services)
(Overview)
 
(48 intermediate revisions by 2 users not shown)
Line 3: Line 3:
  
 
==Overview==
 
==Overview==
 +
For the '''geWorkbench web''' version of Hierarchical Clustering please see [[Hierarchical_Clustering_web]].
 +
 +
 +
Hierarchical clustering is a method to group arrays and/or markers together based on similarity of their expression profiles.
  
 
geWorkbench implements its own code for agglomerative hierarchical clustering.  Starting from individual points (the leaves of the tree), nearest neighbors are found for individual points, and then for groups of points, at each step building up a branched structure that converges toward a root that contains all points.  The resulting graph tends to group similar items together.
 
geWorkbench implements its own code for agglomerative hierarchical clustering.  Starting from individual points (the leaves of the tree), nearest neighbors are found for individual points, and then for groups of points, at each step building up a branched structure that converges toward a root that contains all points.  The resulting graph tends to group similar items together.
Line 11: Line 15:
 
===Dataset===
 
===Dataset===
  
A microarray dataset must be loaded in the Project Folders component.  If an annotation file is also loaded corresponding to the microarray type, then gene names will be used in the results display, otherwise probeset names will be used.
+
A microarray dataset must be loaded in the [[Workspace]].  If an annotation file is also loaded corresponding to the microarray type, then gene names will be used in the results display, otherwise probeset names will be used.
  
 
Note - hierarchical clustering is memory intensive.  With the default memory settings  (see [[FAQ#Q._How_do_I_increase_the_amount_of_memory_available_to_Java_to_run_geWorkbench.3F | here]] to change), clustering more than about 2000 markers is not recommended.
 
Note - hierarchical clustering is memory intensive.  With the default memory settings  (see [[FAQ#Q._How_do_I_increase_the_amount_of_memory_available_to_Java_to_run_geWorkbench.3F | here]] to change), clustering more than about 2000 markers is not recommended.
  
 +
If more than 1000 markers or 1000 arrays are selected for clustering, a popup warning will be issued.
 +
 +
 +
[[Image:HierarchicalClustering_set_too_large.png]]
 +
 +
 +
The actual number of markers or arrays that can be clustered depends on the amount of memory allocated to the Java virtual machine on your computer.
  
  
Line 22: Line 33:
 
==Parameters==
 
==Parameters==
  
[[Image:T_HC_default_settings.png]]
+
[[Image:HC_default_settings.png]]
  
  
  
 
===Clustering Method===
 
===Clustering Method===
Specifies the method used to measure the distance between two clusters as the hierarchical tree is constructed.  In each, the distance from each point in one cluster to each point in a second cluster is calculated, giving a set of pairwise distances. The methods differ in how the overall distance between the clusters is measured:
+
This parameter is used to indicate the convention used for determining cluster-to-cluster distances when constructing the hierarchical tree. Available options are:  
  
====Single Linkage====
+
* '''Single Linkage''' - The distances are measured between each member of one cluster each member of the other cluster. The minimum of these distances is considered the cluster-to-cluster distance. This method often leads to a "chaining" effect and is usually not recommended.
The smallest pairwise distance found between the two clusters is used.
 
  
This method often leads to a "chaining" effect and is usually not recommended.
+
* '''Average Linkage''' - The average distance of each member of one cluster to each member of the other cluster is used as a measure of cluster-to-cluster distance.
====Average Linkage====
 
The average of all the pairwise distances between the two clusters is used.  
 
  
====Total Linkage====
+
* '''Total Linkage''' - The distances are measured between each member of one cluster each member of the other cluster. The maximum of these distances is considered the cluster-to-cluster distance.
The largest pairwise distance found between the two clusters is used.  
 
  
 
===Clustering Dimension===
 
===Clustering Dimension===
====Marker====
+
These are used to indicate whether to cluster markers, microarrays, or both.
Cluster the selected markers (genes) only based on the similarity across selected microarrays.
 
  
====Microarray====
+
* '''Marker''' - Cluster the selected markers (genes) based on the similarity across microarrays.
Cluster the selected microarrays only based on the similarity across selected markers.
+
* '''Microarray''' - Cluster the selected microarrays based on the similarity across markers.
 
+
* '''Both''' - Cluster both markers and microarrays.
====Both====
 
Cluster both markers and microarrays.
 
  
 
===Clustering Metric===
 
===Clustering Metric===
 
The values being clustered, whether markers or microarrays, can each be represented by vectors of numbers, essentially either rows (markers) or columns (microarrays) taken from a spreadsheet view of all expression values.  Several methods by which to calculate the distance between any two vectors are offered:
 
The values being clustered, whether markers or microarrays, can each be represented by vectors of numbers, essentially either rows (markers) or columns (microarrays) taken from a spreadsheet view of all expression values.  Several methods by which to calculate the distance between any two vectors are offered:
  
====Euclidean====
+
* '''Euclidean''' - The direct, point-to-point distance is calculated (square root of the sum of square differences).
The direct, point-to-point distance is calculated (square root of the sum of square differences).
 
  
====Pearson's====
+
* '''Pearson's''' - Pearson's correlation coefficient for two vectors is calculated.
Pearson's correlation coefficient for two vectors is calculated.
 
  
====Spearman's====
+
* '''Spearman's''' - Spearman's rank correlation coefficient for two vectors is calculated.
Spearman's rank correlation coefficient for two vectors is calculated.
 
  
 
===Set Selection===
 
===Set Selection===
Most geWorkbench analysis components provide the "All Arrays" and "All Markers" check-boxes to allow any activated sets of arrays or markers to be overridden.  Normally, activating one or more sets of markers or arrays limits an analysis to those items in the active set(s).
+
Activating one or more sets of markers or arrays in the [[Marker_Sets]] or [[Array_Sets]] components limits an analysis to those items in the active set(s).
 
 
====All Arrays====
 
Use all arrays in the dataset.
 
 
 
====All Markers====
 
Use all markers in the dataset.
 
  
 
==Analysis Actions==
 
==Analysis Actions==
This component uses the standard analysis component framework, which provides three buttons:
+
This component uses the standard [[Analysis_Framework|analysis component framework]], which provides three buttons:
  
===Analyze===
+
* '''Analyze''' - Start the clustering job.
Start the clustering job.
+
* '''Save Settings''' - Save the current settings to a named entry in the settings list.
===Save Settings===
+
* '''Delete Settings''' - Delete the selected setting entry from the list.
Save the current settings to a named entry in the settings list.
 
===Delete Settings===
 
Delete the selected setting entry from the list.
 
 
 
A separate tutorial will be devoted to these common actions.
 
  
 
==Services (Grid)==
 
==Services (Grid)==
Line 86: Line 76:
 
Hierarchical Clustering can be run either locally within geWorkbench, or remotely as a grid job on caGrid.  See the [[Tutorial_-_Grid_Services | Grid Services]] section for further details on setting up a grid job.
 
Hierarchical Clustering can be run either locally within geWorkbench, or remotely as a grid job on caGrid.  See the [[Tutorial_-_Grid_Services | Grid Services]] section for further details on setting up a grid job.
  
==Example==
+
==Running Hierarchical Clustering==
===Running the calculation===
+
This example clusters a set of markers generated through a t-test.  You can generate or select any set of markers to run your own example.
  
This example will take off with the set of markers produced in the [[Tutorial_-_ANOVA | ANOVA]] examplePlease follow the steps for that example to produce the starting marker set, or just create/select another set of markers of your own.
+
1. Activate a set of markers.  Here, "Significant Genes [437]" ( which contains 437 markers) was activated.
  
1. If following the ANOVA example, activate the set of markers labeled "Significant Genes [1786]" ( which contains 1786 markers).
 
  
 +
[[Image:HC_Marker_Set_Activation.png]]
  
[[Image:T_HC_set_activation.png]]
 
  
 +
2. Set the parameters as desired.  Here, we used:
  
2. Set the parameters as shown in the following figure.
 
  
 
+
[[Image:HC_setup.png]]
[[Image:T_HC_setup.png]]
 
  
 
* Clustering methods: Average Linkage.
 
* Clustering methods: Average Linkage.
* Clustering Dimension: Marker.
+
* Clustering Dimension: Both.
* Clustering Metric: Euclidean.
+
* Clustering Metric: Pearson's Correlation.
  
 
3. Click '''Analyze'''.
 
3. Click '''Analyze'''.
Line 120: Line 108:
  
  
 +
The results are placed in the [[Workspace]] and labeled "Hierarchical Clustering", and are displayed in the Dendrogram component.
  
 
+
==The Dendrogram Visual Component==
 
 
The results are placed in the Project Folders component and labeled "Hierarchical Clustering", and are displayed in the Dendrogram component.
 
 
 
==The Dendrogram visual component==
 
 
Hierarchical clustering results are displayed in the Dendrogram component.   
 
Hierarchical clustering results are displayed in the Dendrogram component.   
  
[[Image:T_HC_Dendrogram_display.png]]
+
[[Image:HierarchicalClustering_Dendrogram.png|{{ImageMaxWidth}}]]
  
 
===Controls===
 
===Controls===
Line 153: Line 138:
 
* Signal: the expression value of the selected marker on the selected array.
 
* Signal: the expression value of the selected marker on the selected array.
  
[[Image:T_Color_Mosaic_Tooltip.png]]
+
[[Image:HC_Dendrogram_Tooltip.png]]
 
 
===Left-click actions===
 
Left clicking on any point on the dendrogram will highlight the selected marker in the Markers component. 
 
  
 
===Right-click menu items===
 
===Right-click menu items===
Line 162: Line 144:
 
Right-clicking on the dendrogram will show a menu with two entries:
 
Right-clicking on the dendrogram will show a menu with two entries:
  
* Image Snapshot - place a static snapshot of the tree as currently displayed into Project Folders component.
+
* '''Image Snapshot''' - place a static snapshot of the tree as currently displayed into [[Workspace]].
  
* Add to Set - add the markers represented in the currently displayed tree to a new set in the Markers component called "Cluster Tree".  This is most useful if done after a subtree of markers has been selected.
+
* '''Add to Set''' - add the markers represented in the currently displayed tree to a new set in the Markers component called "Cluster Tree".  This is most useful if done after a subtree of markers has been selected.
  
[[Image:T_HC_Dendrogram_add-to-set.png]]
+
[[Image:HC_Dendrogram_add_to_set.png]]
  
==Example - Displaying hierarchical clustering results==
+
==Working with Hierarchical Clustering Results==
  
The following figure show a close-up of the dendrogram produced by the above hierarchical clustering exampleThe four horizontal bars shown in the diagram were added just to show the boundaries of the four array sets used(Note - you do not need to activate sets for clustering, unless you wish to use only a subset of all available arrays or markers).
+
===Selecting a subtree===
 +
The Dendrogram component allows one to select and work with just a portion of the displayed treeTo activate this feature, check the '''Enable Selection''' checkbox at lower left in the Dendrogram component (indicated by the red arrow)The subtree selection will work for both markers and arrays, depending only on if they were included in the initial clustering calculation.
  
 +
The following figure illustrates selecting a subtree of markers.  Moving the cursor over the displayed tree draws a blue rectangle over the selected portion.
  
[[Image:T_HC_Dendrogram_marked.png]]
 
  
===Selecting a subtree===
+
[[Image:HierarchicalClustering_Dendrogram_select_markers.png|{{ImageMaxWidth}}]]
The Dendrogram component allows one to select and work with just a portion of the displayed tree.  To activate this feature, check the '''Enable Selection''' checkbox at lower left in the Dendrogram component.  The subtree selection will work for both markers and arrays, depending only on if they were included in the initial clustering calculation.  That is, one can only subselect on arrays if the clustering dimension was either "Arrays" or "Both".  
 
  
  
[[Image:T_HC_Dendrogram_EnableSelection.png]]
 
  
The following figure illustrates selecting a subtree of markers.  Moving the cursor over the displayed tree draws a blue rectangle over the selected portion.
 
  
 +
Clicking on the selected area will cause only this area to be displayed, as shown below.  A subset of arrays can also be selected, again shown in blue highlighting.
  
[[Image:T_HC_Dendrogram_selecting.png]]
 
  
 +
[[Image:HC_Dendrogram_select_arrays.png|{{ImageMaxWidth}}]]
  
Clicking on the selected area will cause only this area to be displayed, as shown below.
 
  
 +
The result of the second selection is this small set of data points:
  
[[Image:T_HC_Dendrogram_selection.png]]
 
  
 +
[[Image:HC_markers_arrays_selected.png|{{ImageMaxWidth}}]]
  
This figure from a separate example, where both arrays and markers were clustered, shows a subtree of arrays being selected:
 
  
[[Image:T_HC_Dendrogram_selecting_arrays.png]]
 
  
 
===Working with a subtree in the Dendrogram===
 
===Working with a subtree in the Dendrogram===
 
As already mentioned, the right-click menu allows one to save the markers in a displayed subtree to the Markers component, or the arrays in a displayed array subtree to the Arrays/Phenotypes component:
 
As already mentioned, the right-click menu allows one to save the markers in a displayed subtree to the Markers component, or the arrays in a displayed array subtree to the Arrays/Phenotypes component:
  
* Add to Set - add the markers represented in the currently displayed tree to a new set in the Markers or Array component called "Cluster Tree".
+
* '''Add to Set''' - add the markers and arrays represented in the currently displayed tree to new sets in the Markers and Arrays components called "Cluster Tree".
 +
 
 +
 
 +
[[Image:HC_Dendrogram_add_to_set.png]]
  
  
[[Image:T_HC_Dendrogram_add-to-set.png]]
+
The following figure shows the the new set of markers (labeled "Cluster Tree [105]") after it has been added to the Markers component.  The number in brackets indications how many markers are contained in the set.
  
  
The following figure shows the the new set of markers (labeled "Cluster Tree [17]") after it has been added to the Markers component.  The number in brackets indications how many markers are contained in the set.
+
[[Image:HC_Markers_ClusterTree.png]]
  
  
[[Image:T_HC_MarkerSets-ClusterTree.png]]
+
Arrays are added to the Arrays component (labeled "Cluster Tree [10]"), as shown here:
  
The same can be done with a set of arrays, adding them to the Arrays component, as shown here:
 
  
[[Image:T_HC_ArraySets-ClusterTree.png]]
+
[[Image:HC_Array_ClusterTree.png]]

Latest revision as of 16:39, 19 March 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Overview

For the geWorkbench web version of Hierarchical Clustering please see Hierarchical_Clustering_web.


Hierarchical clustering is a method to group arrays and/or markers together based on similarity of their expression profiles.

geWorkbench implements its own code for agglomerative hierarchical clustering. Starting from individual points (the leaves of the tree), nearest neighbors are found for individual points, and then for groups of points, at each step building up a branched structure that converges toward a root that contains all points. The resulting graph tends to group similar items together.

Results of hierarchical clustering are displayed in the Dendrogram component, which is further described below.

Prerequisites

Dataset

A microarray dataset must be loaded in the Workspace. If an annotation file is also loaded corresponding to the microarray type, then gene names will be used in the results display, otherwise probeset names will be used.

Note - hierarchical clustering is memory intensive. With the default memory settings (see here to change), clustering more than about 2000 markers is not recommended.

If more than 1000 markers or 1000 arrays are selected for clustering, a popup warning will be issued.


HierarchicalClustering set too large.png


The actual number of markers or arrays that can be clustered depends on the amount of memory allocated to the Java virtual machine on your computer.


Missing Values

If there are any missing values in the dataset, an error message will be returned if Hierarchical Clustering is run. Missing values can be filtered out or replaced using the Missing Value Filter or the Missing Value Normalizer.

Parameters

HC default settings.png


Clustering Method

This parameter is used to indicate the convention used for determining cluster-to-cluster distances when constructing the hierarchical tree. Available options are:

  • Single Linkage - The distances are measured between each member of one cluster each member of the other cluster. The minimum of these distances is considered the cluster-to-cluster distance. This method often leads to a "chaining" effect and is usually not recommended.
  • Average Linkage - The average distance of each member of one cluster to each member of the other cluster is used as a measure of cluster-to-cluster distance.
  • Total Linkage - The distances are measured between each member of one cluster each member of the other cluster. The maximum of these distances is considered the cluster-to-cluster distance.

Clustering Dimension

These are used to indicate whether to cluster markers, microarrays, or both.

  • Marker - Cluster the selected markers (genes) based on the similarity across microarrays.
  • Microarray - Cluster the selected microarrays based on the similarity across markers.
  • Both - Cluster both markers and microarrays.

Clustering Metric

The values being clustered, whether markers or microarrays, can each be represented by vectors of numbers, essentially either rows (markers) or columns (microarrays) taken from a spreadsheet view of all expression values. Several methods by which to calculate the distance between any two vectors are offered:

  • Euclidean - The direct, point-to-point distance is calculated (square root of the sum of square differences).
  • Pearson's - Pearson's correlation coefficient for two vectors is calculated.
  • Spearman's - Spearman's rank correlation coefficient for two vectors is calculated.

Set Selection

Activating one or more sets of markers or arrays in the Marker_Sets or Array_Sets components limits an analysis to those items in the active set(s).

Analysis Actions

This component uses the standard analysis component framework, which provides three buttons:

  • Analyze - Start the clustering job.
  • Save Settings - Save the current settings to a named entry in the settings list.
  • Delete Settings - Delete the selected setting entry from the list.

Services (Grid)

Hierarchical Clustering can be run either locally within geWorkbench, or remotely as a grid job on caGrid. See the Grid Services section for further details on setting up a grid job.

Running Hierarchical Clustering

This example clusters a set of markers generated through a t-test. You can generate or select any set of markers to run your own example.

1. Activate a set of markers. Here, "Significant Genes [437]" ( which contains 437 markers) was activated.


HC Marker Set Activation.png


2. Set the parameters as desired. Here, we used:


HC setup.png

  • Clustering methods: Average Linkage.
  • Clustering Dimension: Both.
  • Clustering Metric: Pearson's Correlation.

3. Click Analyze.

4. A progress bar will be visible during the calculation, first displaying a message regarding computing distances....


T HC computing.png


... and then about clustering.


[[T HC Clustering message.png


The results are placed in the Workspace and labeled "Hierarchical Clustering", and are displayed in the Dendrogram component.

The Dendrogram Visual Component

Hierarchical clustering results are displayed in the Dendrogram component.

HierarchicalClustering Dendrogram.png

Controls

Enable Selection

Checking this box allow a subtree of the dendrogram to be selecting interactively using the mouse. The selected area is highlighted in blue. Clicking on the selected area will restrict the display to just the selected portion of the tree.

Gene Height

Sets the height in pixels of the rows devoted to each marker and associated labels.

Gene Width

Sets the width in pixels of the columns devoted to each array. Label text is also scaled proportionately.

Color key

Shows the range of color values from lowest to highest expression for the current display preference. See the Color Mosaic tutorial for further details on the absolute and relative display preference settings.

Intensity slider

The intensity slider adjust the midpoint of the color scale to lower or higher expression values.

Bulb Icon

Pushing the bulb icon activates a tool-tip feature on the dendrogram display. Mousing over the dendrogram will bring up a display of the following information for any point:

  • Chip: the array name
  • Marker: the marker (probeset) name.
  • Signal: the expression value of the selected marker on the selected array.

HC Dendrogram Tooltip.png

Right-click menu items

Right-clicking on the dendrogram will show a menu with two entries:

  • Image Snapshot - place a static snapshot of the tree as currently displayed into Workspace.
  • Add to Set - add the markers represented in the currently displayed tree to a new set in the Markers component called "Cluster Tree". This is most useful if done after a subtree of markers has been selected.

HC Dendrogram add to set.png

Working with Hierarchical Clustering Results

Selecting a subtree

The Dendrogram component allows one to select and work with just a portion of the displayed tree. To activate this feature, check the Enable Selection checkbox at lower left in the Dendrogram component (indicated by the red arrow). The subtree selection will work for both markers and arrays, depending only on if they were included in the initial clustering calculation.

The following figure illustrates selecting a subtree of markers. Moving the cursor over the displayed tree draws a blue rectangle over the selected portion.


HierarchicalClustering Dendrogram select markers.png



Clicking on the selected area will cause only this area to be displayed, as shown below. A subset of arrays can also be selected, again shown in blue highlighting.


HC Dendrogram select arrays.png


The result of the second selection is this small set of data points:


HC markers arrays selected.png


Working with a subtree in the Dendrogram

As already mentioned, the right-click menu allows one to save the markers in a displayed subtree to the Markers component, or the arrays in a displayed array subtree to the Arrays/Phenotypes component:

  • Add to Set - add the markers and arrays represented in the currently displayed tree to new sets in the Markers and Arrays components called "Cluster Tree".


HC Dendrogram add to set.png


The following figure shows the the new set of markers (labeled "Cluster Tree [105]") after it has been added to the Markers component. The number in brackets indications how many markers are contained in the set.


HC Markers ClusterTree.png


Arrays are added to the Arrays component (labeled "Cluster Tree [10]"), as shown here:


HC Array ClusterTree.png