Difference between revisions of "Local Data Files"

m (Open File Dialog)
(Supported data formats)
Line 30: Line 30:
 
The selector lists the available file types that geWorkbench can load.  Once a file type is chosen, only files matching the indicated file extension, e.g. (*.exp, *.fasta etc.) will be displayed in the available file list.
 
The selector lists the available file types that geWorkbench can load.  Once a file type is chosen, only files matching the indicated file extension, e.g. (*.exp, *.fasta etc.) will be displayed in the available file list.
  
==Supported data formats==
+
===Supported data formats===
  
 
* '''Adjacency Matrix''' - An interaction network such as generated using ARACNe.
 
* '''Adjacency Matrix''' - An interaction network such as generated using ARACNe.
Line 47: Line 47:
 
* '''Tab-delimited text''' (e.g. files exported by RMAExpress or other programs)  - A simple columnar file format.
 
* '''Tab-delimited text''' (e.g. files exported by RMAExpress or other programs)  - A simple columnar file format.
  
 +
 +
===Merge Files===
 +
 +
The Merge Files checkbox can be used if multiple files are being read in at one time, each representing a single array.  An example with individual MAS5 format files is illustrated below.
  
 
==Detailed descriptions of expression file formats==
 
==Detailed descriptions of expression file formats==

Revision as of 16:50, 11 March 2011

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Overview

This chapter covers:

  • Data file formats supported by geWorkbench
  • Loading microarray data from local files, while merging the data into one data set.
  • Merging data from several previously loaded microarray experiments.


Open File Dialog

The Open File dialog can be reached in two ways:

  • Right-click on a Project and select "Open file(s)".
  • From the top-level menu bar, select File->Open->File.

File Open Selector.png

File Browser Controls

The file browser offers standard options

  • Select Directory - A pulldown of directories available at the current directory level. The contents of the directory are listed below when a directory is selected.
  • Up One Level - Move up one directory level.
  • Desktop - Go to the user's desktop or home directory.
  • Create New Folder - create a new directory in the current one.
  • List - Display only the names of files and directories contained in the current directory.
  • Details - Display name, size, type and date of items in the directory.

File Open Selector Controls.png


The selector lists the available file types that geWorkbench can load. Once a file type is chosen, only files matching the indicated file extension, e.g. (*.exp, *.fasta etc.) will be displayed in the available file list.

Supported data formats

  • Adjacency Matrix - An interaction network such as generated using ARACNe.
  • Affy .CEL files - Affymetrix CEL files (probe level data) can be viewed graphically in geWorkbench but not used directly for analysis.
  • Affymetrix File Matrix - a geWorkbench spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets. It can contain the data from any number of arrays in a spreadsheet-like "matrix" format. It also allows for the grouping of arrays into named subsets based on phenotypic criteria. Multiple such groups can be defined, each containing a different division of the arrays among named sets. geWorkbench can create files in this format from data read in in other formats.
  • Affymetrix MAS5/GCOS files - produced by Affymetrix data analysis programs.
  • FASTA files. DNA or amino-acid sequence files in FASTA format.
  • Genepix .GPR files - Produced by a popular analysis program for two-color microarrays.
  • GEO Soft and GEO Series formats
    • GSM - individual sample files.
    • GSE - series files representing an entire experiment, e.g. GSE2189_family.soft.
    • GDS - curated data matrix, e.g. GDS507_sample.soft.
    • Series matrix - matrix format of submitted, uncurated data, e.g. GSE2189_series_matrix.txt.
  • MAGE-TAB data matrix - this is an auxiliary MAGE-TAB data file type that can be used to contain summarized data from multiple arrays.
  • PDB Structure File - Molecular 3-D structure in Protein Data Bank format; can be viewed in the JMol Viewer in geWorkbench..
  • Tab-delimited text (e.g. files exported by RMAExpress or other programs) - A simple columnar file format.


Merge Files

The Merge Files checkbox can be used if multiple files are being read in at one time, each representing a single array. An example with individual MAS5 format files is illustrated below.

Detailed descriptions of expression file formats

See File Formats for detailed descriptions of some of the file formats used.


Microarray Data Annotation files

If a microarray data file is being loaded, the dialog will give the user the option of also loading an annotation file. These files associate the individual markers, e.g. probesets, with information such as gene name, Entrez ID, GO Terms etc.

Currently geWorkbench only accepts Affymetrix format annotation files, in CSV format. See File_Formats#Annotation_Files for further details.



Microrray data and merging datasets

When working with microarray data, all data to be analyzed must be present within one data node in a project. If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used. geWorkbench can perform this merging step either at the time data is read in, or later in a separate step. Once merged, such a dataset can be saved to disk; it will be saved in the geWorkbench matrix file format.

Tutorial: Loading microarray data files - local

In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset. The origin of these file is described in the section Tutorial_-_Data.

Note that no Affymetrix annotation files are included in the geWorkbench distribution. The annotation file for the HG-U95Av2 array used in this example can be obtained from the Affymetrix website. See the instructions in the FAQ entry on this topic.

1. Right-click on the default Workspace entry in the Project Folders component.

T ProjectFolders NewProj.png


2. Next, right-click on the new Project entry and select Open Files.

T ProjectFolders OpenFiles.png


A file browser will appear with which you can select the files you wish to open. The default is to browse for local files, that is on your own computer. geWorkbench can also access data from caArray databases (the Remote option).

T ProjectFolders OpeningCardio Merge.png


3. Here, we select the file type Affymetrix GCOS/MAS5 as shown.

4. Make sure to check the Merge files checkbox. This will create the merged data node as the files are read in.

5. We will select 10 MAS5 format text files from the directory data/cardiogenomics.med.harvard.edu, which is included within the geWorkbench installation.

6. Click Open.


A message will appear giving information about associating an annotation file with the dataset.


T ProjectFolders Annotations.png


7. A file browser will then open with which you can, if you wish, select an annotation file matching your dataset. This is needed if you intend to use features of geWorkbench such as the Sequence Retriever or GO Terms component (Gene Ontology).


T ProjectFolders ChooseAnnotation.png


A status bar will display as the data is loaded:


T ProjectFolders StatusBar.png


8. The merged data set is listed in the Project folder. The individual arrays are shown below in the Arrays component.


T ProjectFolders CardioLoaded.png


Note - if you have created a custom Affymetrix-format annotation file which has more than one entry for a probeset, an error dialog will be shown.

Annotation Parser Handle Duplicates.png

See File_Formats#Affymetrix_Annotation_Files for further details.

Tutorial: Merging microarray data files after they have already been loaded.

If data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.


1. Select the read-in data files that you want to merge.

2. Click on File in the menu bar, and choose Merge Datasets.

T ProjectFolders MergeDatasets.png


3. The result is a new data node containing the merged data. The original data nodes are still present.

T ProjectFolders MergedData.png