Difference between revisions of "Workspace"

Line 32: Line 32:
  
 
===Microrray data and merging===
 
===Microrray data and merging===
A file from disk or from the network is be opened within a given project.  Creation of a new project is described below.  When working with microarray data, all data to be analyzed must be present within one data node in a project.  If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used.  geWorkbench can perform this merging step either at the time data is read in, or later in a separate step.
+
A file from disk or from the network is be opened within a given project.  Creation of a new project is described below.  When working with microarray data, all data to be analyzed must be present within one data node in a project.  If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used.  geWorkbench can perform this merging step either at the time data is read in, or later in a separate step.  Once merged, such a dataset can be saved out to disk; it will be saved in the geWorkbench matrix file format.
  
 
==Limitations==
 
==Limitations==

Revision as of 17:20, 5 June 2006

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Outline

In this tutorial, you will learn how to:

  • Create a new Project.
  • Load microarray data.
  • Merge data from several loaded microarray experiments.
  • Rename a project and/or project node.
  • Remove a project and/or project node.
  • Save project files that you have created.
  • Load, add, and/or modify remote data.


Supported data formats

  • Microarray
    • Affymetrix MAS5/GCOS files - produced by the Affymetrix data analysis programs.
    • Affymetrix File Matrix - a spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets.
    • Tab-delimited text (RMAExpress file) - A simple columnar file format, produced by the program RMAExpress.
    • Affy Excel or txt data file - formats for single Affymetrix experiments (not supported).
    • Genepix files - Produced by a popular analysis program for two-color microarrays.
  • Other
    • FASTA files. DNA or amino-acid sequence files in FASTA format.
    • Pattern files - sequence motifs produced using the Pattern Discovery component of geWorkbench.
    • Genotypic data files - (not supported).

Data organization

Workspaces and Projects

In the Project Folders component there is a top-level object called a workspace. The workspace can contain one or more separate projects, and each proejct can contain opened data files and analysis results. The workspace as a whole, with all its projects and data nodes, can be saved and restored. Projects allow data to be grouped, for example by experiment. A project can contain many different types of data, for example microarray data, fasta sequence files and graphical images.

Microrray data and merging

A file from disk or from the network is be opened within a given project. Creation of a new project is described below. When working with microarray data, all data to be analyzed must be present within one data node in a project. If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used. geWorkbench can perform this merging step either at the time data is read in, or later in a separate step. Once merged, such a dataset can be saved out to disk; it will be saved in the geWorkbench matrix file format.

Limitations

Only one data node can be selected at one time. If you wish to save a data node to a file, in most cases you must specify a file type extension, such as ".exp" for the geWorkbench merged file matrix format, or ".fasta" for a sequence file. At present, the only type of remote data source which can be opened is NCICB's caArray database.


Creating a new project and loading microarray data files

In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset.

All data must belong to a project. Right-click on the Workspace entry in the Project Folders window at upper left to create a new project.

T NewProject.png


Next, right-click on the New Project entry and select Open Files.

T OpenFiles.png


Here, we will select file type Affymetrix GCOS/MAS5 as shown.

Make sure to check the Merge files checkbox. This will created the merged data node as the files are read in.

We will select 10 MAS5 format text files from the directory geworkbench\data\training\cardiogenomics.med.harvard.edu, which is included in the geWorkbench download.

Click Open.

T OpenFile CardioMerge.png


You may see the message "The chip type HG_U95Av2 is recognized..."

T OpenFile ChipRecog.png


The merged dataset is listed in the Project folder. The data is displayed, in single array format, in the Microarray Viewer. Note we have increased the intensity slider to maximum here. You can scroll through the arrays from first to last using the slider. The display in the Microarray Viewer is by marker in the linear order the markers appear in the data file. It does not correspond in any way to a physical picture or representation of the actual 2-D microarray.

T FullApp MergedData.png



Merging microarray datafiles after they have already been loaded.

If Affymetrix data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.


1. Select the read-in data files that you want to merge.

2. Click on File in the menu bar, and choose Merge Datasets.

The picture shows the resulting merged dataset created from several individual data files.

T ProjectFolder MergeIndivid.png


The result is a new data node containing the merged data. The original data nodes are still present.

T ProjectFolder IndividMerged.png


Renaming a project or a data node

Renaming a project

1. Right-click on Project folder.

2. Select Rename.


T ProjectFolder RenameProject.png


3. In the pop-up screen rename your project.

4. Click on the OK button


Renaming a project data node

1. Right-click on a Project Folder data node.

2. Select Rename.

T ProjectFolder RenameDataset.png


3. In the pop-up screen rename your data node.

T ProjectFolder RenameDataset2.png


4. Click on the OK button.


Removing a project or a data node

Removing a project

1. Right-click on Project folder.

2. Select Remove.


Removing a project data node

1. Right-click on the data node.

2. Select Remove.


Saving a data node to a file

It is here that, among other things, you can create the matrix multi-experiment file format used by geWorkbench from a merged dataset.

1. Right-click on data node that you want to save.

2. Click Save.

T ProjectFolder SaveNode.png


A standard file Save screen will come up.

3. Choose a location.

4. Enter a name. Here you should be careful to enter an appropriate file type extension. as this is not automatic. For example for the merged multi-experiment matrix file type you should include the extension ".exp" in the filename.

5. Click on the Save button.


Working with remote data sources

The remote Open File dialog

geWorkbench can retrieve data from certain remote data sources; currently only instances of the NCICB's caArray database are supported. The Open File dialog allows remote sources to be added to the list of those available either manually or through discovery using grid services. Entries (locations, parameters) for non-grid services can be edited.

As before, right-click on Project which will bring up the Open File dialog. Click the Remote radio button. The Open File dialog window will be expanded to include remote sources.

Note the distinction between the "Open File" button, which opens a local or remote file, and the "Go" button, described below, which connects to a chosen remote resource to allow browsing.

(T)MEditRemoteData.png

Four additional buttons appear. They are:

caArray button - lists remote resources.

Go button - connects to the selected remote source.

Add A New Resource button - Opens the Data Source Definition Page used to add a remote data source.

Edit button - Edits remote source parameters.


Loading data from a remote instance of caArray

Click on the Go button next to the caArray data source at the bottom of the dialog. All available caArray experiments at that location will be displayed.

T ProjectFolder caArrayExpts.png

Note that the type of experiment data provided here in caArray is of type "derived bioassay". This is data that has processed from raw data, for example using RMA.

Select an experiment that has derived bioassays. Here we depict the experiment ending in *99049. The number of derived bioassays, 12, is displayed, along with the experiment information. (A new dataset, "Public Rembrandt" has subsequently been added, which would also be good to use for experimenting with caArray data download. It has 53 bioassays available).

To retrieve the bioassays themselves, right click on the experiment and press Get bioassays. This will download the list of available bioassays into geWorkbench.

T ProjectFolder GetRemoteBioassays.png


To actually retrieve bioassay data, select the desired arrays and push the Open button. (Although below we show retrieving multiple array datasets, for demonstration purposes you might want to first select just one, as each can take several minutes to download).

You can either select the merge option here, or wait until all data has been successfully download to perform a merge later.

T ProjectFolder OpenRemoteBioassays.png

To add a remote source

(Note - currently only caArray data sources are supported).

1. Click on the Add A New Resource button.

(T)MRemoteData2.png This is the Data Source Definition Page

2. Fill in the Data Source definition page. URL and Short Name are required fields.

3. Click on the OK button.

The configuration is set up to automatically reflect your additional Data Source.


To modify a remote source

The specification of the remote resource can be edited.

1. Click on the Edit button at the bottom of the Open File dialog.

2. Make the changes that you need.

3. Click on the OK button

T ProjectPanel EditRemote.png