Difference between revisions of "Basics"

 
(Data representation)
 
(46 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 +
{{TutorialsTopNav}}
  
==Introduction==
 
  
[[Image:slide1.gif]]
+
__TOC__
  
 +
==Introduction to geWorkbench==
 +
geWorkbench (genomics Workbench) is a Java-based open-source platform for integrated genomics. Using a component architecture it allows individually developed plug-ins to be configured into complex bioinformatic applications. At present there are more than 70 available plug-ins supporting the visualization and analysis of gene expression and sequence data. Example use cases include:
  
geWorkbench is an open-source bioinformatics platform that offers a comprehensive and extensible collection of tools for the management, analysis, visualization and annotation of biomedical data. Data types supported include:
+
* loading data from local or remote data sources.
 +
* visualizing gene expression, molecular interaction networks, protein sequence and protein structure data in a variety of ways.
 +
* providing access to client- and server-side computational analysis tools such as t-test analysis, hierarchical clustering, self organizing maps, regulatory neworks reconstruction, BLAST searches, pattern/motif discovery, etc.
 +
* validating computational hypothesis through the integration of gene and pathway annotation information from curated sources as well as through Gene Ontology enrichment analysis.  
  
*Microarray Gene Expression
 
*DNA and Protein Sequences
 
*Pathways
 
*Patterns
 
*Gene Ontology
 
*Networks
 
  
Most importantly, it provides an environment which supports moving from one data type to another in a seamless fashion, e.g. from gene expression to sequences to patterns.
 
  
===Developing for geWorkbench===
+
==Layout of the Graphical User Interface==
geWorkbench has been designed using a plug-in framework which allows new modules to be developed with relative ease.  A repository will be maintained for community-developed modules.  Developers can take advantage of all the existing capabilities for data management and visualization, and thus concentrate development efforts on the more important, novel aspects of their project.
 
  
===geWorkbench as an interface to external data and computational resources===
 
geWorkbench provides access to a variety of external data sources, including:
 
Microarray gene expression repositories (caArray)
 
Gene annotation pages (via CGAP)
 
DNA sequences
 
Pathway diagrams (BioCarta)
 
  
geWorkbench also provides a gateway to several computational services currently hosted on Columbia servers and clusters, including:
 
*BLAST
 
*Pattern Discovery
 
*Synteny
 
  
 +
The graphical user interface for geWorkbench is divided into three major sections, for
  
 +
# (1) Data management - [[Workspace]] (upper left)
 +
# (2) Visualization tools (right)
 +
# (3) Marker and Array/Phenotype set selection and management  (lower left)
  
With geWorkbench you can work with both mircoarray gene expression data and with gene or protein sequences.  Many kinds of analysis are supported - for microarrays, there are filtering and normalization, basic statistical analyses, clustering, network reverse engineering, as well as many common visualization tools.  For sequence data there are routines such as BLAST, pattern detection, transcription factor mapping, and syntenic region analyis.  Furthermore, genomic sequences around markers of interest found in microarray experiements can be easily retrieved and, for example, used for promoter/TF analysis.
 
  
 +
[[Image:GeWorkbench_full_GUI_color_mosaic.png|{{ImageMaxWidth}}]]
  
  
To start using geWorkbench, one must supply initial datafiles.  For microarray data, several formats are currently available, including MAS5/GCOS text files, GenePix files, and a simple, geWorkbench-specific matrix formatIn the next section, we will show how to read in MAS5 format files and write out a matrix fileFor sequence data, fasta format files are accepted.
+
Commands for analysis, filtering and normalization are now available in a menu activated by right-clicking on any loaded data nodeThey can also be found in the top level menu-bar under "Commands"They will act on the currently selected data node.
  
  
 +
[[Image:Basics_Analysis_t-test.png]]
  
 +
==Window Decorations==
  
==GUI and Component Interoperability==
+
Each component within an area is resizable and detachable.
  
 +
'''Undock'''
 +
* The Undock button is an arrow pointing up and to the right.
 +
* Clicking on the Undock icon detaches the component.  It can then be expanded and positioned at will on the screen.
 +
* In the detached component, the Dock icon is a left-downward arrow. Clicking on the Dock icon reattaches the panel within its GUI area.
  
  
 +
'''Sizing Arrows'''
 +
* To resize an area discretely (maximize/minimize), click on the triangular-shaped wedges located within the separators between panels. Maximizing a panel fills the entire window vertically or horizontally, and minimizes the adjacent panel.
  
  
 +
[[Image:Window_decorations.png]]
  
The graphical user interface for geWorkbench is divided into for major sections, for
 
  
1. Data management
+
'''Sizing Handle'''
 +
* Any separator between two panes that has a sizing handle can be moved by right- or left-clicking anywhere on the separator and dragging the edge.
  
2. Marker and Phenotype management
 
  
3. Visualization tools (primarily)
+
[[Image:Window_separator_handle.png]]
  
4. Analytical tools
+
==Command Popup==
  
 +
Hitting F12 on the keyboard will pop-up a shortcut menu to the geWorkbench commands available for the current dataset.
  
The data managment area (1) is called the Project Panel.  It can hold one workspace, and a workspace in turn can hold one or more projects.  Projects can be used as wished to group different data sets.  Any data file or analysis result is stored in a project.  A workspace and all the data it contains can be saved and returned to later.
+
[[Image:F12-Popup.png]]
 
 
 
 
The most important design goal of geWorkbench is to allow data produced or altered in one module to be easily transfered to other modules for successive analysis steps.  There are two places that hold shared data - the Project component (1), and the Panels component(2).  While the Project component holds files and various types of analysis result sets, the Panels component groups markers or phenotypes into panels.  These panels can then be selected for further analysis of only that particular subset of data.  For example, several analysis components produce lists of markers, and each such new list is placed into the Markers component as a new marker panel.  An example of using a phenotype panel is to group microarrays by their disease state.  In a series of tutorials below, we will demonstrate how a panel of markers is defined through selecting a cluster in the Hierarchical Clustering component, and this panel of markers is then passed to the Sequence Retrieval component to begin sequence analysis.
 
 
 
 
 
A key feature of the GUI is that the modules displayed in the Visualization (3) and Analysis (4) areas depend on the type of data currently selected in the Project Panel.  Thus you will see a different set of choices (tabs) when a microarray data set is selected, as compared to when a DNA or protein sequence file is selected.  When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project Panel, but an appropriate viewer in the Visualization area is automatically selected.
 
  
 +
==Data management area==
 +
The [[Workspace]] is a top-level geWorkbench component that holds data, analysis results, images, sequences, networks and other types of input and output.  A [[Workspace]] and all the data within it can be saved and later reloaded.  Only one [[Workspace]] at a time can be open.  These operations will be described in detail in further sections of the tutorials.
 +
 
The GUI provides a menu bar at top with a standard choice of commands.  Many commands that are available in the menu bar are also available by right-clicking on data objects.
 
The GUI provides a menu bar at top with a standard choice of commands.  Many commands that are available in the menu bar are also available by right-clicking on data objects.
  
  
Begin Mary::::
+
==Set selection and management==
[[Image:(T)MarGettingStarted.png]]
 
 
 
 
 
In order for data to appear you need to load your experiment(s)'''''See Loading Data Tutorial.'''''
 
 
 
Overview of the GUI and Component Interoperability
 
The graphical user interface for geWorkbench is divided into four major sections
 
 
 
1. Project Area - '' Data management''
 
  
2. Selection Area - ''Marker and Phenotype management''
+
A key feature of geWorkbench is the ability to work with defined sets of markers or arrays. This allows subsets of data to be analyzed, and allows for passing of selected subsets of data between different components. For example, the t-test produces a list of markers showing a significant difference in expression between two states, and this list can then be used to retrieve relevant sequences or annotations.
  
3. View Area - ''Visualization tools (primarily)''
+
==Visualization  and Analysis tools==
 +
geWorkbench works such that only the visualization and analysis components relevant to the type of dataset currently selected in the [[Workspace]] area (1) are displayed in the Visualization Area (2) or in the command menus.  Thus choosing a microarray dataset will result in a different set of visualizers being displayed as compared with those seen when a nucleotide sequence file is selected.   When a new data file is loaded, or an analysis produces a new data set, not only is it added to the [[Workspace]] (1), but an appropriate viewer in the Visualization area (2) is automatically selected.
  
4. Analysis Area - ''Analytical tools''
+
==Component Interoperability==
  
 +
The features described above underly the most important design goal of geWorkbench, which is to allow the different components to interoperate easily.  The user is freed of the need to write programs to convert data from one format to another for different programs.  Both the [[Workspace]] (1) and the Set Selection component(3) can hold shared data.  Typically each run of an analysis places either a new dataset into the [[Workspace]], or a new set of markers or arrays into the Set Selection component.  These are then available to any other appropriate component for reuse.
  
The data managment area (1) is called the Project Area. It can hold one workspace, and a workspace in turn can hold one or more projects. Projects can be used as wished to group different data sets. Any data file or analysis result is stored in a project. A workspace and all the data it contains can be saved and returned to later.
 
  
 +
==Data representation==
  
The most important design goal of geWorkbench is to allow data produced or altered in one module to be easily transfered to other modules for successive analysis steps. There are two places that hold shared data - the Project Area (1), and the Selection Area(2). While the Project Area holds files and various types of analysis result sets, the Selection Area groups markers or phenotypes into panels. These panels can then be selected for further analysis of only that particular subset of data. For example, several analysis components produce lists of markers, and each such new list is placed into the markers component as a new marker panel. An example of using a phenotype panel is to group microarrays by their disease state. In a series of tutorials below, we will demonstrate how a panel of markers is defined through selecting a cluster in the Hierarchical Clustering component, and this panel of markers is then passed to the Sequence Retrieval component to begin sequence analysis.  
+
geWorkbench works on a single dataset at one time, as selected in the [[Workspace]]. For working with data from multiple microarrays simultaneously, for example for statistical analysis or clustering, all of the data must be represented as a single two-dimensional array of experiments and markers.   Such a representation can be created externally, for example in Excel, or geWorkbench can read in expression results from single arrays and merge them into a combined dataset. The merging process is described in detail in [[Local_Data_Files#Merging_Microarray_Files|Local Data Files]] and also in [[Menu_Bar#Merge_Datasets|Menu Bar]].  
  
 +
An option at the time of file read-in allows multiple files to be opened and merged simultaneously, either from disk or from remote data sources such as caArray.  Alternatively, single datasets that have already been read-in can also be merged at a later time.  Either route creates a single data object in the [[Workspace]] representing all of the arrays.  This object can be viewed, for example, using the color mosaic component.  The merged representation can be saved to the geWorkbench *.exp file type.
  
A key feature of the GUI is that the modules displayed in the View Area(3) and Analysis Area(4) areas depend on the type of data currently selected in the Project Panel. Thus you will see a different set of choices (tabs) when a microarray data set is selected, as compared to when a DNA or protein sequence file is selected. When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Project Panel, but an appropriate viewer in the Visualization area is automatically selected.  
+
The results of any data analysis are stored in the [[Workspace]] as a child of the dataset from which they were created.
  
The GUI provides a menu bar at top with a standard choice of commands. Many commands that are available in the menu bar are also available by right-clicking on data objects.
+
==Limitations==
  
[[Image:(T)MGettingStarted.png]]
+
As currently implemented, operations such as filtering and normalization directly alter the loaded dataset.  However, the original data file is not changed.

Latest revision as of 14:42, 22 April 2014

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Introduction to geWorkbench

geWorkbench (genomics Workbench) is a Java-based open-source platform for integrated genomics. Using a component architecture it allows individually developed plug-ins to be configured into complex bioinformatic applications. At present there are more than 70 available plug-ins supporting the visualization and analysis of gene expression and sequence data. Example use cases include:

  • loading data from local or remote data sources.
  • visualizing gene expression, molecular interaction networks, protein sequence and protein structure data in a variety of ways.
  • providing access to client- and server-side computational analysis tools such as t-test analysis, hierarchical clustering, self organizing maps, regulatory neworks reconstruction, BLAST searches, pattern/motif discovery, etc.
  • validating computational hypothesis through the integration of gene and pathway annotation information from curated sources as well as through Gene Ontology enrichment analysis.


Layout of the Graphical User Interface

The graphical user interface for geWorkbench is divided into three major sections, for

  1. (1) Data management - Workspace (upper left)
  2. (2) Visualization tools (right)
  3. (3) Marker and Array/Phenotype set selection and management (lower left)


GeWorkbench full GUI color mosaic.png


Commands for analysis, filtering and normalization are now available in a menu activated by right-clicking on any loaded data node. They can also be found in the top level menu-bar under "Commands". They will act on the currently selected data node.


Basics Analysis t-test.png

Window Decorations

Each component within an area is resizable and detachable.

Undock

  • The Undock button is an arrow pointing up and to the right.
  • Clicking on the Undock icon detaches the component. It can then be expanded and positioned at will on the screen.
  • In the detached component, the Dock icon is a left-downward arrow. Clicking on the Dock icon reattaches the panel within its GUI area.


Sizing Arrows

  • To resize an area discretely (maximize/minimize), click on the triangular-shaped wedges located within the separators between panels. Maximizing a panel fills the entire window vertically or horizontally, and minimizes the adjacent panel.


Window decorations.png


Sizing Handle

  • Any separator between two panes that has a sizing handle can be moved by right- or left-clicking anywhere on the separator and dragging the edge.


Window separator handle.png

Command Popup

Hitting F12 on the keyboard will pop-up a shortcut menu to the geWorkbench commands available for the current dataset.

F12-Popup.png

Data management area

The Workspace is a top-level geWorkbench component that holds data, analysis results, images, sequences, networks and other types of input and output. A Workspace and all the data within it can be saved and later reloaded. Only one Workspace at a time can be open. These operations will be described in detail in further sections of the tutorials.

The GUI provides a menu bar at top with a standard choice of commands. Many commands that are available in the menu bar are also available by right-clicking on data objects.


Set selection and management

A key feature of geWorkbench is the ability to work with defined sets of markers or arrays. This allows subsets of data to be analyzed, and allows for passing of selected subsets of data between different components. For example, the t-test produces a list of markers showing a significant difference in expression between two states, and this list can then be used to retrieve relevant sequences or annotations.

Visualization and Analysis tools

geWorkbench works such that only the visualization and analysis components relevant to the type of dataset currently selected in the Workspace area (1) are displayed in the Visualization Area (2) or in the command menus. Thus choosing a microarray dataset will result in a different set of visualizers being displayed as compared with those seen when a nucleotide sequence file is selected. When a new data file is loaded, or an analysis produces a new data set, not only is it added to the Workspace (1), but an appropriate viewer in the Visualization area (2) is automatically selected.

Component Interoperability

The features described above underly the most important design goal of geWorkbench, which is to allow the different components to interoperate easily. The user is freed of the need to write programs to convert data from one format to another for different programs. Both the Workspace (1) and the Set Selection component(3) can hold shared data. Typically each run of an analysis places either a new dataset into the Workspace, or a new set of markers or arrays into the Set Selection component. These are then available to any other appropriate component for reuse.


Data representation

geWorkbench works on a single dataset at one time, as selected in the Workspace. For working with data from multiple microarrays simultaneously, for example for statistical analysis or clustering, all of the data must be represented as a single two-dimensional array of experiments and markers. Such a representation can be created externally, for example in Excel, or geWorkbench can read in expression results from single arrays and merge them into a combined dataset. The merging process is described in detail in Local Data Files and also in Menu Bar.

An option at the time of file read-in allows multiple files to be opened and merged simultaneously, either from disk or from remote data sources such as caArray. Alternatively, single datasets that have already been read-in can also be merged at a later time. Either route creates a single data object in the Workspace representing all of the arrays. This object can be viewed, for example, using the color mosaic component. The merged representation can be saved to the geWorkbench *.exp file type.

The results of any data analysis are stored in the Workspace as a child of the dataset from which they were created.

Limitations

As currently implemented, operations such as filtering and normalization directly alter the loaded dataset. However, the original data file is not changed.