Difference between revisions of "Tutorials"

 
(114 intermediate revisions by 3 users not shown)
Line 1: Line 1:
__TOC__
+
{{TutorialsTopNav}}
===Getting Started===
 
  
With geWorkbench you can work with both mircoarray gene expression data and with gene or protein sequences.  Many kinds of analysis are supported - for microarrays, there are filtering and normalization, basic statistical analyses, clustering, network reverse engineering, as well as many common visualization tools.  For sequence data there are routines such as BLAST, pattern detection, transcription factor mapping, and syntenic region analyis.  Furthermore, genomic sequences around markers of interest found in microarray experiements can be easily retrieved and, for example, used for promoter/TF analysis.
+
__NOTOC__
  
geWorkbench is designed from the ground up to be extensible.  New modules can be programmed to interact directly with its framework, or  existing code can be wrapped in a geWorkbench adaptor to allow seamless communications with the framework and other modules.
 
  
To start using geWorkbench, one must supply initial datafiles.  For microarray data, several formats are currently available, including MAS5/GCOS text files, GenePix files, and a simple, geWorkbench-specific matrix format.  In the next section, we will show how to read in MAS5 format files and write out a matrix file.  For sequence data, fasta format files are accepted.
+
The tutorials shown on this page provide a quick introduction to the most important features of geWorkbench.  
  
For the time being, the latest version of geWorkbench for Windows can be downloaded from the below link.  Full download instructions and versions for Linux and Macintosh will be available in the Download section very soon....
+
For the '''web version''' of geWorkbench, please see [[GeWorkbench-web/Tutorials | GeWorkbench-web Tutorials]]
  
[http://amdec-bioinfo.cu-genome.org/html/caWorkBench/newCaW3Downloads/zipfiles/InstData/Windows/|geWorkbench3.0]
 
  
  
===Loading Data===
+
==Using the basic framework of geWorkbench==
 +
The graphical interface, files and data==
  
When first started, geWorkbench appears so:
+
===[[QuickStart | Quick Start]]===
 +
A quick jump into the most important topics for learning to use geWorkbench.
  
[[Image:T_StartupState.png]]
+
===[[Basics | Basics]]===
 +
An introduction to the use of geWorkbench.
  
 +
===[[Menu_Bar]]===
 +
Many geWorkbench commands are available in the upper menu bar, as well as in the [[Workspace]].
  
Right-click on the '''Workspace''' entry in the '''Project Folders''' window at upper left to create a new project.
+
===[[Component_Configuration_Manager | Component Configuration Manager]]===
 +
Customize geWorkbench to your needs.  geWorkbench comes initially configured with only basic components installed.  Use the CCM to load additional available modules.
  
[[Image:T_NewProject.png]]
+
===[[Workspace]]===
 +
The [[Workspace]] is where data is loaded and analysis results are stored.
  
 +
===[[Information Panel]]===
 +
Describes components use to record details of calculations and datasets.
  
 +
===[[Local Data Files | Local Data Files]]===
 +
Covers loading data from files on your local computer.
  
Next, right-click on the new project entry and select '''Open Files'''.
+
===[[File_Formats | File Formats]]===
 +
Details of several different file formats supported by geWorkbench.
  
[[Image:T_OpenFiles.png]]
+
===[[CaArray|caArray]]===
 +
How to download microarray data from caArray.  geWorkbench can download "derived" data sets from caArray.
  
 +
===[[Array Sets]] ===
 +
How to create and use sets of arrays for controlling data analysis.
  
 +
===[[Marker Sets]] ===
 +
How to create and use sets of markers for controlling data analysis.
  
Here we will select 10 MAS5 format text files from the directory geworkbench\data\training\cardiogenomics.med.harvard.edu, which is included in the geWorkbench download:
+
===[[Viewing a Microarray Dataset | Viewing a Microarray Dataset]]===
 +
Survey of geWorkbench visualiztion tools for microarray data.
 +
Includes:
  
[[Image:T_SelectMAS5.png]]
+
* [[Viewing_a_Microarray_Dataset#Microarray_Viewer | Microarray Viewer]]
 +
* [[Viewing_a_Microarray_Dataset#Tabular_Microarray_Viewer | Tabular Microarray Viewer]]
 +
* [[Viewing_a_Microarray_Dataset#CEL_file_image_viewer | CEL file image viewer]]
 +
* [[Viewing_a_Microarray_Dataset#Color_Mosaic | Color Mosaic]]
 +
* [[Viewing_a_Microarray_Dataset#Expression_Profiles | Expression Profiles]]
 +
* [[Viewing_a_Microarray_Dataset#Scatter_Plot |  Scatter Plot]]
  
 +
===[[Filtering|Filtering]]===
 +
geWorkbench provides numerous methods for filtering microarray data.
  
 +
===[[Normalization|Normalization]]===
 +
geWorkbench provides numerous methods for normalizing microarray data.
  
The chip type HG_U95Av2 is recognized...
+
===[[Tutorial_Data | Tutorial Data]]===
 +
Downloadable data used in the tutorials.
  
[[Image:T_Chip_type_message.png]]
+
==Individual analysis and visualization components==
  
 +
===[[Analysis Framework | Analysis Framework]]===
 +
Most analysis routines are located in the command area located in the lower right quadrant of geWorkbench.  This section describes a common framework for saving parameter settings that these components share.
  
 +
===[[ANOVA | ANOVA]]===
 +
How to set up and run Analysis  of Variance.
  
The read-in data is displayed in the '''Microarray Panel'''Note we have increased the instensity slider to maximum here.
+
===[[ARACNe | ARACNE]]===
 +
Formal method for reverse Engineering - microarray datasets can be analyzed for interactions between genesNow includes new ARACNe2, which implements the much faster Adaptive Partitioning algorithm and accurate parameter estimation.
  
[[Image:T_MAS5_display.png]]
+
===[[BLAST| BLAST]]===
 +
Submits BLAST jobs to the NCBI server and displays and allows further interaction with alignment results.
  
 +
===[[Cellular Networks KnowledgeBase | Cellular Networks KnowledgeBase (CNKB)]]===
 +
The CNKB component queries a database of protein-protein and protein-DNA interactions maintained at Columbia University.
  
 +
===[[CeRNA_Query]]===
 +
This component provides query access to a precomputed database of competitive endogenous RNA (ceRNA) interactions, also called "sponge" interactions. These interactions underlie a post-transcriptional layer of regulation, and were predicted using the Hermes algorithm (Sumazin et al., 2011).
  
We can now assign phenotypes to each chipWe will place the phenotypes in the default group, however you can create new phenotype groups by pushing the '''New''' button on the '''Phenotype Panel''' at lower left.
+
===[[Classification | Classification]]===
 +
Several classification components have been ported by the GenePattern development team to work with geWorkbenchThese include K-nearest neighbors (KNN), Principle Component Analysis (PCA), Support Vector Machines (SVM) and Weighted Voting (WV).
  
Here we select and label arrays in the '''Phenotype Panel''' which contain samples from the congestive cardiomyopathy disease state...
+
===[[Color Mosaic | Color Mosaic]]===
 +
Displays expression results as a heat map.
  
[[Image:T_PanelLabelCardio.png]]
+
===[[Consensus_Clustering | Consensus Clustering]]===
 +
This component allows geWorkbench to run Consensus Clustering on a GenePattern server.
  
 +
===[[Cytoscape_Network_Viewer | Cytoscape]]===
 +
Cytoscape is used to display network interaction diagrams (from adjacency matrices).  It features two-way interaction with the geWorkbench Markers component.
  
 +
===[[Cupid]]===
 +
Cupid (Sumazin et al. 2011) generates information that can help predict if a gene is a target of a specific miRNA. The Cupid service provides a simple query interface to a database of precalculated Cupid results.
  
Next, we can similarly label the remaining arrays as "Normal". We have also checked boxes to indicate that these groups of arrays are "Active".  Various analysis and visualization components can be set to only use/display activated arrays or markers.
+
===[[DeMAND]]===
 +
The DeMAND (Drug Mode of Action through Network Dysreguation) algorithm measures dysregulation between the expression of two genes in a network caused by e.g. a drug perturbation. The list of top dysregulated gene pairs can reveal details of a drug's mode of action in the tested cellular system or tissue.  
  
[[Image:T_PhenotypesPriorToCase.png]]
 
  
 +
===[[Expression Value Distribution | Expression Value Distribution]]===
 +
View and manipulate a histogram of the distribution of expression values for each array.
  
 +
===[[Fold Change]]===
 +
Compare the ratio of the expression of genes between two sets of arrays, e.g. case and control sets.
  
For statistical tests such as the t-test the Case and Control groups can be specified.  This is done by left-clicking on the thumb-tack icon in front of the phenotype name.  Here we are specifying the disease arrays as the "Case".  The remaining "Normal" arrays are by default labeled control.
+
===[[Gene_Ontology_Term_Analysis | Gene Ontology Term Analysis]]===
[[Image:T_PhenotypeSettingCase.png]]
+
Finds Gene Ontology terms that are over-represented in a list of genes of interest.
  
 +
===[[Gene_Ontology_Viewer | Gene Ontology Viewer]]===
 +
The Gene Ontology Viewer provides both a standalone GO Term browser, as well as displaying results of GO Term Analysis.  Genes associated with a term can be copied back into a marker set for further analysis.
  
 +
===[[GenomeSpace]]===
 +
GenomeSpace allows for the transfer of data between a number of different genomics and bioinformatics software analysis platforms, including geWorkbench.
  
A red thumbtack indicates the arrays have been specified as "Case".
+
===[[genSpace | genSpace]]===
 +
GenSpace is a social networking tool  which allows patterns of use (putative workflows) of geWorkbench components to be inferred and queried.  If desired, (participation is entirely optional) it  can be used to identify potential expert users of  particular components who may be able provide advice.
  
[[Image:T_PhenotypeCaseSet.png]]
+
===[[Grid Services | Grid Services]]===
 +
A number of geWorkbench data analysis components have been implemented as services on the National Cancer Institute's caGrid.  caGrid is an infrastructure component of the NCI's [http://cabig.nci.nih.gov/ caBIG(R)] program.
  
 +
===[[GSEA]]===
 +
Implements a front-end for submitting data to and viewing the results of a GSEA (Subramanian et al, 2005) analysis on a GenePattern server.
  
 +
===[[Hierarchical Clustering | Hierarchical Clustering]]===
 +
geWorkbench implements its own agglomerative hierarchical clustering algorithm.
  
We can also rename the merged dataset by clicking on its entry in the '''Project Panel'''.
+
===[[IDEA]]===
 +
The IDEA (interactome dysregulation enrichment analysis) algorithm uses a genome-wide molecular interaction map as a systematic framework for the identification of genes playing a role in oncogenesis.
  
[[Image:T_RenameDataset.png]]
+
===[[Jmol | Jmol]]===
 +
Jmol is a molecular structure viewer for viewing PDB format files.
  
 +
===[[K-Means_Clustering]]===
 +
Provides an interface to running K-Means Clustering on a GenePattern server, and a viewer for the results.
  
 +
===[[LINCS_Query]]===
 +
This component provides for query and display of data generated by the Columbia LINCS Technology U01 and Computation U01 Centers. It provides experimental and computational results for drug mode of action and similarity calculations, and for synergy experiments.
  
Here we will call it CCMP.
+
===[[Marker Annotations | Marker Annotations]]===
 +
Marker annotations can be retrieved, including BioCarta pathway diagrams.
  
[[Image:T_RenamingDataset.png]]
+
===[[MarkUs | MarkUs]]===
 +
The MarkUs component assists in the assessment  of the biochemical function for a given protein structure.  The component in geWorkbench provides an interface to the MarkUs web server at Columbia.  MarkUs identifies related protein structures and sequences, detects  protein cavities, and calculates the surface electrostatic potentials  and amino acid conservation profile.
  
 +
===[[Master Regulator Analysis | Master Regulator Analysis]]===
 +
Master Regulator analysis [Lefebvre et al., 2010] is an algorithm used to identify transcription factors whose targets (e.g., as represented in an ARACNe-generated interactome) are enriched for a particular gene signature (e.g. a list of differentially expressed genes).
  
 +
====[[MRA-FET]]====
 +
Master Regulator Analysis using Fisher's Exact Test.
  
With the datasets merged, classified and named, we can save the dataset for future use. We will call it "cardiomyopathy.exp" (.exp is the default extension for the geWorkbench matrix file type).
+
====[[MARINa]]====
 +
Master Regulator Analysis using the MARINa algoarithm. GSEA is used to compute enrichment.
  
[[Image:T_SaveProject.png]]
+
===[[MatrixREDUCE | MatrixREDUCE]]===
 +
MatrixREDUCE is a tool for inferring the binding  specificity and nuclear concentration of transcription  factors from microarray data.
  
 +
===[[MINDy| MINDy]]===
 +
MINDy identifies modulators of gene regulation using conditional ARACNe calculations.
  
 +
===[[Pattern Discovery | Pattern Discovery]]===
 +
Upstream seqeunce can be analyzed for conserved sequence patterns.
  
The default display of microarray data is an absolute displayWe can change it to a relative display by selecting Tools:Preferences from the top menubar.  We have removed the dataset so that we can read it back in using the new preferences.
+
===[[PCA| Principle Component Analysis (PCA)]]===
 +
Find components of the data responsible for the greatest varianceProvides a front-end to analysis on a GenePattern server, and graphical visualization of the results.
  
[[Image:T_ChangePrefs.png]]
+
===[[Promoter Analysis | Promoter Analysis]]===
 +
Search a set of sequences against a promoter database.
  
 +
===[[Pudge | Pudge]]===
 +
Pudge provides an interface to a protein structure prediction server  (Honig lab) which integrates tools used at different stages  of the structural prediction process.
  
 +
===[[SAM|SAM]]===
 +
Interface to run the R implementation of Significance Analysis of Microarrays.
  
Here we select the '''relative''' display type.
+
===[[Sequence_Retriever | Sequence Retriever]]===
 +
Genomic and protein sequences for selected genes can be retrieved for further analysis.
  
[[Image:T_ChangePrefsToRelative.png]]
+
===[[SkyBase]]===
 +
Search the SkyBase database with a sequence of interest to find homology models which meet user-defined alignment coverage and sequence identity constraints.
 +
SkyBase is a database that stores the homology models built by SkyLine analysis for
 +
* structures in the RCSB Protein Data Bank (PDB) with a 60% redundancy cutoff
 +
* (PDB60) structures in the Northeast Structural Genomics Consortium database
  
 +
===[[SkyLine]]===
 +
SlyLine is a pipeline for large-scale protein homology modeling.  [[SkyBase]] provides access to precomputed models generated using SkyLine.
  
 +
===[[SOM | SOM]]===
 +
Clustering using Self-Organizing Maps.
  
Returning to the Open File dialog as we before by right-clicking on the project entry, we will select the "cardiomyopathy.exp" file we previously saved...
+
===[[SVM | SVM]]===
 +
Classification using Support Vector Machines.
  
[[Image:T_OpenCardio.png]]
+
===[[T-test | T-Test]]===
 +
Several variants of the Student's t-Test for Differential expression are available.
  
 +
===[[Viper_Analysis]]===
 +
The VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) [Alvarez et al., manuscript in preparation] component in geWorkbench transforms the expression profile for each sample (column) into a transcription-factor activity profile, representing the relative activity of each TF in each sample.
  
 
+
===[[Volcano_Plot]]===
Resulting in the following colorful display of the array data for the first array....
+
The Volcano Plot graphically depicts the results of the t-test for differential expression.  The log2 fold change for each significant marker is plotted against the -log10 of the P-value.
 
 
[[Image:T_RelativeDisplay.png]]
 

Latest revision as of 11:12, 29 April 2015

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot




The tutorials shown on this page provide a quick introduction to the most important features of geWorkbench.

For the web version of geWorkbench, please see GeWorkbench-web Tutorials


Using the basic framework of geWorkbench

The graphical interface, files and data==

Quick Start

A quick jump into the most important topics for learning to use geWorkbench.

Basics

An introduction to the use of geWorkbench.

Menu_Bar

Many geWorkbench commands are available in the upper menu bar, as well as in the Workspace.

Component Configuration Manager

Customize geWorkbench to your needs. geWorkbench comes initially configured with only basic components installed. Use the CCM to load additional available modules.

Workspace

The Workspace is where data is loaded and analysis results are stored.

Information Panel

Describes components use to record details of calculations and datasets.

Local Data Files

Covers loading data from files on your local computer.

File Formats

Details of several different file formats supported by geWorkbench.

caArray

How to download microarray data from caArray. geWorkbench can download "derived" data sets from caArray.

Array Sets

How to create and use sets of arrays for controlling data analysis.

Marker Sets

How to create and use sets of markers for controlling data analysis.

Viewing a Microarray Dataset

Survey of geWorkbench visualiztion tools for microarray data. Includes:

Filtering

geWorkbench provides numerous methods for filtering microarray data.

Normalization

geWorkbench provides numerous methods for normalizing microarray data.

Tutorial Data

Downloadable data used in the tutorials.

Individual analysis and visualization components

Analysis Framework

Most analysis routines are located in the command area located in the lower right quadrant of geWorkbench. This section describes a common framework for saving parameter settings that these components share.

ANOVA

How to set up and run Analysis of Variance.

ARACNE

Formal method for reverse Engineering - microarray datasets can be analyzed for interactions between genes. Now includes new ARACNe2, which implements the much faster Adaptive Partitioning algorithm and accurate parameter estimation.

BLAST

Submits BLAST jobs to the NCBI server and displays and allows further interaction with alignment results.

Cellular Networks KnowledgeBase (CNKB)

The CNKB component queries a database of protein-protein and protein-DNA interactions maintained at Columbia University.

CeRNA_Query

This component provides query access to a precomputed database of competitive endogenous RNA (ceRNA) interactions, also called "sponge" interactions. These interactions underlie a post-transcriptional layer of regulation, and were predicted using the Hermes algorithm (Sumazin et al., 2011).

Classification

Several classification components have been ported by the GenePattern development team to work with geWorkbench. These include K-nearest neighbors (KNN), Principle Component Analysis (PCA), Support Vector Machines (SVM) and Weighted Voting (WV).

Color Mosaic

Displays expression results as a heat map.

Consensus Clustering

This component allows geWorkbench to run Consensus Clustering on a GenePattern server.

Cytoscape

Cytoscape is used to display network interaction diagrams (from adjacency matrices). It features two-way interaction with the geWorkbench Markers component.

Cupid

Cupid (Sumazin et al. 2011) generates information that can help predict if a gene is a target of a specific miRNA. The Cupid service provides a simple query interface to a database of precalculated Cupid results.

DeMAND

The DeMAND (Drug Mode of Action through Network Dysreguation) algorithm measures dysregulation between the expression of two genes in a network caused by e.g. a drug perturbation. The list of top dysregulated gene pairs can reveal details of a drug's mode of action in the tested cellular system or tissue.


Expression Value Distribution

View and manipulate a histogram of the distribution of expression values for each array.

Fold Change

Compare the ratio of the expression of genes between two sets of arrays, e.g. case and control sets.

Gene Ontology Term Analysis

Finds Gene Ontology terms that are over-represented in a list of genes of interest.

Gene Ontology Viewer

The Gene Ontology Viewer provides both a standalone GO Term browser, as well as displaying results of GO Term Analysis. Genes associated with a term can be copied back into a marker set for further analysis.

GenomeSpace

GenomeSpace allows for the transfer of data between a number of different genomics and bioinformatics software analysis platforms, including geWorkbench.

genSpace

GenSpace is a social networking tool which allows patterns of use (putative workflows) of geWorkbench components to be inferred and queried. If desired, (participation is entirely optional) it can be used to identify potential expert users of particular components who may be able provide advice.

Grid Services

A number of geWorkbench data analysis components have been implemented as services on the National Cancer Institute's caGrid. caGrid is an infrastructure component of the NCI's caBIG(R) program.

GSEA

Implements a front-end for submitting data to and viewing the results of a GSEA (Subramanian et al, 2005) analysis on a GenePattern server.

Hierarchical Clustering

geWorkbench implements its own agglomerative hierarchical clustering algorithm.

IDEA

The IDEA (interactome dysregulation enrichment analysis) algorithm uses a genome-wide molecular interaction map as a systematic framework for the identification of genes playing a role in oncogenesis.

Jmol

Jmol is a molecular structure viewer for viewing PDB format files.

K-Means_Clustering

Provides an interface to running K-Means Clustering on a GenePattern server, and a viewer for the results.

LINCS_Query

This component provides for query and display of data generated by the Columbia LINCS Technology U01 and Computation U01 Centers. It provides experimental and computational results for drug mode of action and similarity calculations, and for synergy experiments.

Marker Annotations

Marker annotations can be retrieved, including BioCarta pathway diagrams.

MarkUs

The MarkUs component assists in the assessment of the biochemical function for a given protein structure. The component in geWorkbench provides an interface to the MarkUs web server at Columbia. MarkUs identifies related protein structures and sequences, detects protein cavities, and calculates the surface electrostatic potentials and amino acid conservation profile.

Master Regulator Analysis

Master Regulator analysis [Lefebvre et al., 2010] is an algorithm used to identify transcription factors whose targets (e.g., as represented in an ARACNe-generated interactome) are enriched for a particular gene signature (e.g. a list of differentially expressed genes).

MRA-FET

Master Regulator Analysis using Fisher's Exact Test.

MARINa

Master Regulator Analysis using the MARINa algoarithm. GSEA is used to compute enrichment.

MatrixREDUCE

MatrixREDUCE is a tool for inferring the binding specificity and nuclear concentration of transcription factors from microarray data.

MINDy

MINDy identifies modulators of gene regulation using conditional ARACNe calculations.

Pattern Discovery

Upstream seqeunce can be analyzed for conserved sequence patterns.

Principle Component Analysis (PCA)

Find components of the data responsible for the greatest variance. Provides a front-end to analysis on a GenePattern server, and graphical visualization of the results.

Promoter Analysis

Search a set of sequences against a promoter database.

Pudge

Pudge provides an interface to a protein structure prediction server (Honig lab) which integrates tools used at different stages of the structural prediction process.

SAM

Interface to run the R implementation of Significance Analysis of Microarrays.

Sequence Retriever

Genomic and protein sequences for selected genes can be retrieved for further analysis.

SkyBase

Search the SkyBase database with a sequence of interest to find homology models which meet user-defined alignment coverage and sequence identity constraints. SkyBase is a database that stores the homology models built by SkyLine analysis for

  • structures in the RCSB Protein Data Bank (PDB) with a 60% redundancy cutoff
  • (PDB60) structures in the Northeast Structural Genomics Consortium database

SkyLine

SlyLine is a pipeline for large-scale protein homology modeling. SkyBase provides access to precomputed models generated using SkyLine.

SOM

Clustering using Self-Organizing Maps.

SVM

Classification using Support Vector Machines.

T-Test

Several variants of the Student's t-Test for Differential expression are available.

Viper_Analysis

The VIPER (Virtual Inference of Protein-activity by Enriched Regulon analysis) [Alvarez et al., manuscript in preparation] component in geWorkbench transforms the expression profile for each sample (column) into a transcription-factor activity profile, representing the relative activity of each TF in each sample.

Volcano_Plot

The Volcano Plot graphically depicts the results of the t-test for differential expression. The log2 fold change for each significant marker is plotted against the -log10 of the P-value.