Difference between revisions of "Workspace"

(Loading data from a remote instance of caArray)
Line 1: Line 1:
 
{{TutorialsTopNav}}
 
{{TutorialsTopNav}}
  
==Outline==
+
=Outline=
 
In this tutorial, you will learn how to:  
 
In this tutorial, you will learn how to:  
  
Line 14: Line 14:
  
  
==Overview==
+
=Overview=
 
This section covers three areas - (1) the logical organization of data within geWorkbench into projects and workspaces, (2)working with data files in this logical structure,  and (3) special handling of microarray data files, including obtaining data from local and remote sources, and merging multiple data files into one.  These topics are bundled into one because these various operations are all handled within the same '''Project Folders''' component.
 
This section covers three areas - (1) the logical organization of data within geWorkbench into projects and workspaces, (2)working with data files in this logical structure,  and (3) special handling of microarray data files, including obtaining data from local and remote sources, and merging multiple data files into one.  These topics are bundled into one because these various operations are all handled within the same '''Project Folders''' component.
  
===Workspaces and Projects===
+
==Workspaces and Projects==
 
In the '''Project Folders''' component there is a top-level object called a workspace.  The workspace can contain one or more separate projects, and each project can contain opened data files and analysis results. An analogy might be that a workspace is like a drawer in a filing cabinet, and projects are individual folders in that drawer. Projects allow data to be grouped, for example by experiment.  A project can contain many different types of data, for example microarray data, FASTA sequence files and graphical images.  The workspace as a whole, with all its projects and data nodes, can be saved and restored.  However, only one workspace can be open at one time.
 
In the '''Project Folders''' component there is a top-level object called a workspace.  The workspace can contain one or more separate projects, and each project can contain opened data files and analysis results. An analogy might be that a workspace is like a drawer in a filing cabinet, and projects are individual folders in that drawer. Projects allow data to be grouped, for example by experiment.  A project can contain many different types of data, for example microarray data, FASTA sequence files and graphical images.  The workspace as a whole, with all its projects and data nodes, can be saved and restored.  However, only one workspace can be open at one time.
  
===Microrray data and merging===
+
==Microrray data and merging==
 
When working with microarray data, all data to be analyzed must be present within one data node in a project.  If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used.  geWorkbench can perform this merging step either at the time data is read in, or later in a separate step.  Once merged, such a dataset can be saved to disk; it will be saved in the geWorkbench matrix file format.
 
When working with microarray data, all data to be analyzed must be present within one data node in a project.  If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used.  geWorkbench can perform this merging step either at the time data is read in, or later in a separate step.  Once merged, such a dataset can be saved to disk; it will be saved in the geWorkbench matrix file format.
  
Line 36: Line 36:
 
**NetBoost Edge List - used by a component still under development.
 
**NetBoost Edge List - used by a component still under development.
  
==Limitations==
 
The current method of saving a workspace is Java version specific (serialization), and so should not be used for long-term data storage.  At present, the only type of remote data source which can be opened is NCICB's caArray database.  The remote file open feature is not multi-threaded, so you cannot perform other tasks in geWorkbench while downloading remote files.
 
  
 +
==Section 1: Working with Projects==
  
==Creating a new project and loading microarray data files==
+
===Creating a new project===
 
 
In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset.  The origin of these file is described in the section [[Tutorial_-_Data]]
 
  
 
All data must belong to a project.  Right-click on the '''Workspace''' entry in the '''Project Folders''' window at upper left to create a new project.
 
All data must belong to a project.  Right-click on the '''Workspace''' entry in the '''Project Folders''' window at upper left to create a new project.
Line 49: Line 46:
  
  
 +
===Renaming a project===
 +
 +
'''1.''' Right-click on '''Project''' folder.
 +
 +
'''2.''' Select '''Rename'''.
 +
 +
 +
[[Image:T_ProjectFolder_RenameProject.png]]
 +
 +
 +
'''3.''' In the pop-up screen rename your project.
 +
 +
'''4.''' Click on the '''OK''' button
  
Next, right-click on the '''New Project''' entry and select '''Open Files'''.
 
  
[[Image:T_OpenFiles.png]]
 
  
  
 +
===Renaming a project data node===
  
A file browser will appear with which you can select the files you wish to open. The default is to browse for local files, that is on your own computer.  geWorkbench can also access data from caArray databases.
+
'''1.''' Right-click on a Project Folder data node.
  
[[Image:T_OpenFile_MAS5.png]]
+
'''2.''' Select '''Rename'''.
  
 +
[[Image:T_RenameNode.png]]
  
Here, we will select file type '''Affymetrix GCOS/MAS5''' as shown.
 
  
Make sure to check the '''Merge files''' checkbox.  This will create the merged data node as the files are read in.
 
  
We will select 10 MAS5 format text files from the directory cardiogenomics.med.harvard.edu, which is included in the geWorkbench tutorial download.  
+
'''3.''' In the pop-up screen rename your data node.
  
Click '''Open'''.
+
[[Image:T_ProjectFolder_RenameDataset2.png]]
  
  
 +
'''4.''' Click on the '''OK''' button.
  
A message will appear giving information about associating an annotation file with the dataset.
 
  
[[Image:T_Annotation-selection1.png]]
+
===Removing a project===
  
A file browser will then open with which you can, if you wish, select an annotation file matching your dataset. This is needed if you intend to use features of geWorkbench such as the Sequence Retriever or  GO Terms component (Gene Ontology).
+
'''1.''' Right-click on '''Project''' folder.
  
 +
'''2.''' Select '''Remove'''.
  
[[Image:T_Annotation-selection2.png]]
 
  
  
 +
===Removing a project data node===
  
A status bar will display as the data is loaded:
+
'''1.''' Right-click on the data node.
  
[[Image:T_File-loading-status-bar.png]]
+
'''2.''' Select '''Remove'''.  
  
  
 +
===Saving a data node to a file===
  
The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the '''Microarray Viewer'''.  Note we have increased the intensity slider to maximum here.  You can scroll through the arrays from first to last using the slider.  The display in the '''Microarray Viewer''' is by marker in the linear order the markers appear in the data file.  It does not correspond in any way to a physical picture or representation of the actual 2-D microarray.
+
It is here that, among other things, you can create the matrix multi-experiment file format used by geWorkbench from a merged dataset.
  
[[Image:T_Merged_MAS5_files.png]]
+
'''1.''' Right-click on data node that you want to save.  
  
==Merging microarray datafiles after they have already been loaded.==
+
'''2.''' Click '''Save'''.
  
If data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.
+
[[Image:T_NodeOptionsMenu.png]]
  
  
'''1.''' Select the read-in data files that you want to merge.
+
A standard file '''Save''' screen will come up.
  
'''2.''' Click on '''File''' in the menu bar, and choose '''Merge Datasets'''.
+
'''3.''' Choose a location.
  
The picture shows the resulting merged dataset created from several individual data files.
+
'''4.''' Enter a name.
  
[[Image:T_MergeIndividual.png]]
+
'''5.''' Click on the '''Save''' button.
  
  
The result is a new data node containing the merged data.  The original data nodes are still present.
 
  
[[Image:T_ProjectFolder_IndividMerged.png]]
+
==Section 2: Loading microarray data files - local==
  
 +
In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset.  The origin of these file is described in the section [[Tutorial_-_Data]]
  
==Renaming a project or a data node==
 
  
 +
Next, right-click on the '''New Project''' entry and select '''Open Files'''.
  
===Renaming a project===
+
[[Image:T_OpenFiles.png]]
  
'''1.''' Right-click on '''Project''' folder.
 
  
'''2.''' Select '''Rename'''.
 
  
 +
A file browser will appear with which you can select the files you wish to open.  The default is to browse for local files, that is on your own computer.  geWorkbench can also access data from caArray databases.
  
[[Image:T_ProjectFolder_RenameProject.png]]  
+
[[Image:T_OpenFile_MAS5.png]]
  
  
'''3.''' In the pop-up screen rename your project.
+
Here, we will select file type '''Affymetrix GCOS/MAS5''' as shown.
  
'''4.''' Click on the '''OK''' button
+
Make sure to check the '''Merge files''' checkbox.  This will create the merged data node as the files are read in.
  
 +
We will select 10 MAS5 format text files from the directory cardiogenomics.med.harvard.edu, which is included in the geWorkbench tutorial download.
  
===Renaming a project data node===
+
Click '''Open'''.
  
'''1.''' Right-click on a Project Folder data node.
 
  
'''2.''' Select '''Rename'''.
 
  
[[Image:T_RenameNode.png]]
+
A message will appear giving information about associating an annotation file with the dataset.
  
 +
[[Image:T_Annotation-selection1.png]]
  
 +
A file browser will then open with which you can, if you wish, select an annotation file matching your dataset.  This is needed if you intend to use features of geWorkbench such as the Sequence Retriever or  GO Terms component (Gene Ontology).
  
'''3.''' In the pop-up screen rename your data node.
 
  
[[Image:T_ProjectFolder_RenameDataset2.png]]
+
[[Image:T_Annotation-selection2.png]]
  
  
'''4.''' Click on the '''OK''' button.
 
  
==Removing a project or a data node==
+
A status bar will display as the data is loaded:
  
===Removing a project===
+
[[Image:T_File-loading-status-bar.png]]
  
'''1.''' Right-click on '''Project''' folder.
 
  
'''2.''' Select '''Remove'''.
 
  
 +
The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the '''Microarray Viewer'''.  Note we have increased the intensity slider to maximum here.  You can scroll through the arrays from first to last using the slider.  The display in the '''Microarray Viewer''' is by marker in the linear order the markers appear in the data file.  It does not correspond in any way to a physical picture or representation of the actual 2-D microarray.
  
 +
[[Image:T_Merged_MAS5_files.png]]
  
===Removing a project data node===
 
  
'''1.''' Right-click on the data node.
+
==Section 3: Merging microarray data files after they have already been loaded.==
  
'''2.''' Select '''Remove'''.  
+
If data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.
  
  
==Saving a data node to a file==
+
'''1.''' Select the read-in data files that you want to merge.
  
It is here that, among other things, you can create the matrix multi-experiment file format used by geWorkbench from a merged dataset.
+
'''2.''' Click on '''File''' in the menu bar, and choose '''Merge Datasets'''.
  
'''1.''' Right-click on data node that you want to save.  
+
The picture shows the resulting merged dataset created from several individual data files.
  
'''2.''' Click '''Save'''.
+
[[Image:T_MergeIndividual.png]]
  
[[Image:T_NodeOptionsMenu.png]]
 
  
 +
The result is a new data node containing the merged data.  The original data nodes are still present.
  
A standard file '''Save''' screen will come up.
+
[[Image:T_ProjectFolder_IndividMerged.png]]
  
'''3.''' Choose a location.
 
  
'''4.''' Enter a name.
 
  
'''5.''' Click on the '''Save''' button.
 
  
 
==Working with remote data sources==
 
==Working with remote data sources==

Revision as of 15:42, 13 August 2008

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Outline

In this tutorial, you will learn how to:

  • Create a new Project.
  • Load microarray data.
  • Merge data from several loaded microarray experiments.
  • Download data from a remote instance of the caArray database (supported by NCI).
  • Rename a project and/or project node.
  • Remove a project and/or project node.
  • Save project files that you have created.
  • Load, add, and/or modify remote data.


Overview

This section covers three areas - (1) the logical organization of data within geWorkbench into projects and workspaces, (2)working with data files in this logical structure, and (3) special handling of microarray data files, including obtaining data from local and remote sources, and merging multiple data files into one. These topics are bundled into one because these various operations are all handled within the same Project Folders component.

Workspaces and Projects

In the Project Folders component there is a top-level object called a workspace. The workspace can contain one or more separate projects, and each project can contain opened data files and analysis results. An analogy might be that a workspace is like a drawer in a filing cabinet, and projects are individual folders in that drawer. Projects allow data to be grouped, for example by experiment. A project can contain many different types of data, for example microarray data, FASTA sequence files and graphical images. The workspace as a whole, with all its projects and data nodes, can be saved and restored. However, only one workspace can be open at one time.

Microrray data and merging

When working with microarray data, all data to be analyzed must be present within one data node in a project. If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used. geWorkbench can perform this merging step either at the time data is read in, or later in a separate step. Once merged, such a dataset can be saved to disk; it will be saved in the geWorkbench matrix file format.


Supported data formats

  • Microarray
    • Affymetrix MAS5/GCOS files - produced by the Affymetrix data analysis programs.
    • Affymetrix File Matrix - a spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets.
    • Tab-delimited text (RMAExpress file) - A simple columnar file format, as produced by the program RMAExpress.
    • Genepix .GPR files - Produced by a popular analysis program for two-color microarrays.
    • Affymetrix CEL files - these files of probe level data can be viewed graphically in geWorkbench but not used directly for analysis.
  • Other
    • FASTA files. DNA or amino-acid sequence files in FASTA format.
    • PDB files - protein 3-dimensional structure files can be viewed in the JMol Viewer in geWorkbench.
    • NetBoost Edge List - used by a component still under development.


Section 1: Working with Projects

Creating a new project

All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project.

T NewProject.png


Renaming a project

1. Right-click on Project folder.

2. Select Rename.


T ProjectFolder RenameProject.png


3. In the pop-up screen rename your project.

4. Click on the OK button



Renaming a project data node

1. Right-click on a Project Folder data node.

2. Select Rename.

T RenameNode.png


3. In the pop-up screen rename your data node.

T ProjectFolder RenameDataset2.png


4. Click on the OK button.


Removing a project

1. Right-click on Project folder.

2. Select Remove.


Removing a project data node

1. Right-click on the data node.

2. Select Remove.


Saving a data node to a file

It is here that, among other things, you can create the matrix multi-experiment file format used by geWorkbench from a merged dataset.

1. Right-click on data node that you want to save.

2. Click Save.

T NodeOptionsMenu.png


A standard file Save screen will come up.

3. Choose a location.

4. Enter a name.

5. Click on the Save button.


Section 2: Loading microarray data files - local

In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset. The origin of these file is described in the section Tutorial_-_Data


Next, right-click on the New Project entry and select Open Files.

T OpenFiles.png


A file browser will appear with which you can select the files you wish to open. The default is to browse for local files, that is on your own computer. geWorkbench can also access data from caArray databases.

T OpenFile MAS5.png


Here, we will select file type Affymetrix GCOS/MAS5 as shown.

Make sure to check the Merge files checkbox. This will create the merged data node as the files are read in.

We will select 10 MAS5 format text files from the directory cardiogenomics.med.harvard.edu, which is included in the geWorkbench tutorial download.

Click Open.


A message will appear giving information about associating an annotation file with the dataset.

T Annotation-selection1.png

A file browser will then open with which you can, if you wish, select an annotation file matching your dataset. This is needed if you intend to use features of geWorkbench such as the Sequence Retriever or GO Terms component (Gene Ontology).


T Annotation-selection2.png


A status bar will display as the data is loaded:

T File-loading-status-bar.png


The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here. You can scroll through the arrays from first to last using the slider. The display in the Microarray Viewer is by marker in the linear order the markers appear in the data file. It does not correspond in any way to a physical picture or representation of the actual 2-D microarray.

T Merged MAS5 files.png


Section 3: Merging microarray data files after they have already been loaded.

If data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.


1. Select the read-in data files that you want to merge.

2. Click on File in the menu bar, and choose Merge Datasets.

The picture shows the resulting merged dataset created from several individual data files.

T MergeIndividual.png


The result is a new data node containing the merged data. The original data nodes are still present.

T ProjectFolder IndividMerged.png



Working with remote data sources

The remote Open File dialog

geWorkbench can retrieve data from certain remote data sources; currently only instances of the NCICB's caArray database are supported. The Open File dialog allows remote sources to be added to the list of those available either manually or through discovery using grid services. Entries (locations, parameters) for non-grid services can be edited.

As before, right-click on Project which will bring up the Open File dialog. Click the Remote radio button. The Open File dialog window will be expanded to include remote sources.

Note the distinction between the "Open File" button, which opens a local or remote file, and the "Go" button, described below, which connects to a chosen remote resource to allow browsing.

T OpenFile Remote.png


They buttons at the bottom of the remote file dialog are:

Source button - lists remote resources.

Go button - downloads a list of all available experiments from the remote source.

Filtering - Allows one to see a list of particular types of experiements, such as by organism or chip type, on the remote source.

Add A New Resource button - Opens the Data Source Definition Page used to add a remote data source.

Edit button - Edits remote source parameters.

Loading data from a remote instance of caArray

geWorkbench can download data from instances of caArray. In the File->Open dialog, click the Remote setting.

You can edit the connection settings to connect to the desired instance of caArray by pushing the Edit button. The configuration for connecting to the production instance of caArray at the NCI is shown here:

T caArray EditConnection.png


If you click on the red Go button next to the caArray data source at the bottom of the dialog, all available caArray experiments at that location will be displayed.

Instead, you can select only particular kinds of experiments by pushing the Filter button. Here we show experiments of type "Human" being selected.


T caArray Filter.png


And here are the resulting entries in the database:

T caArray Human Expts.png


Push the Show Arrays button to see the individual array datasets available for download for this experiment.

T caArray ShowArrays.png


Now we will select four of the arrays and push the Open button to begin the download. Dont' forget to click the Merge button first if desired to merge the data into a single dataset.

T caArray Open4Arrays.png


You will be prompted to select the quantitation type from those available for the experiment. Here we select CHP Signal:

T caArray QuantType.png

To add a remote source

(Note - currently only caArray data sources are supported).

1. Click on the Add A New Resource button.

(T)MRemoteData2.png This is the Data Source Definition Page

2. Fill in the Data Source definition page. URL and Short Name are required fields.

3. Click on the OK button.

The configuration is set up to automatically reflect your additional Data Source.


To modify a remote source

The specification of the remote resource can be edited.

1. Click on the Edit button at the bottom of the Open File dialog.

2. Make the changes that you need.

3. Click on the OK button

T ProjectPanel EditRemote.png