Difference between revisions of "Local Data Files"

(Supported data formats)
Line 12: Line 12:
 
*Microarray
 
*Microarray
 
**'''Affymetrix MAS5/GCOS''' files - produced by Affymetrix data analysis programs.
 
**'''Affymetrix MAS5/GCOS''' files - produced by Affymetrix data analysis programs.
**'''Affymetrix File Matrix''' - a geWorkbench spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets.  There are two data columns per array; the first contains the signal value, the second contains either a p-value or an Affymetrix Present/Missing/Absent callThe header format for this file is complex.
+
**'''Affymetrix File Matrix''' - a geWorkbench spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets.  It can contain the data from any number of arrays in a spreadsheet-like "matrix" format.  It also allows for the grouping of arrays into named subsets based on phenotypic criteria.  Multiple such groups can be defined, each containing a different division of the arrays among named setsgeWorkbench can create files in this format from data read in in other formats.
 
** '''GEO Soft formats'''  
 
** '''GEO Soft formats'''  
 
*** '''series''' files, e.g. GSE2189_family.soft.
 
*** '''series''' files, e.g. GSE2189_family.soft.
Line 30: Line 30:
 
Several of the expression file formats are described in detail on [[File_Formats| File Formats]] page.
 
Several of the expression file formats are described in detail on [[File_Formats| File Formats]] page.
  
==Details of the geWorkbench Affymetrix File Matrix format==
 
 
This file format is proprietary to geWorkbench.  It contains the data from any number of arrays in a spreadsheet-like "matrix" format.  It also allows for the group of arrays into named sets based on phenotypic criteria.  Multiple such groups can be defined, each containing a different division of the arrays among named sets.
 
 
geWorkbench can create files in this format from data read in in other formats.
 
 
There are two data columns for each array, signal and confidence.  The confidence value can either be a p-value or an Affymetrix present/marginal/absent (P/M/A) call.
 
The format is difficult to create by hand because the descriptive lines containing array names and phenotype groups contain only a single column per array, whereas there are two data columns per array.  That is, data columns are not directly labeled by their header lines.
 
 
[[Media:Bcell.head.txt| '''Example file:''']] An example of the first 14 lines of the file Bcell-100.exp (used in many examples in these tutorials) is available [[Media:Bcell.head.txt| here]].
 
 
# The file is tab delimited.
 
#The first line begins with the word "AffyID", then the word "Annotation" in the second column.  Columns 3 and on contain the array names.
 
#There can be any number of phenotypic groups on the following lines, each beginning with the word "Description", followed by the name of the group in the second column.  Columns three on contain the particular set label for each array.
 
# After the Description lines, if any, the remaining lines contain the data matrix.  The first column contains the marker name (Affy ID).  The second column can contain annotation information.  However, annotations can also be read in from a separate annotation file (Affymetrix CSV format).  The remaining columns contain, as explained above, the signal and confidence values for each array.
 
 
 
The basic format is as follows:
 
AffyID Annotation ArrayName1 ArrayName2 ArrayName3 ArrayName4 etc...
 
Description Set_A_Name subset_A1_Name subset_A1_Name subset_A2_Name subset_A2_Name etc...
 
Description Set_B_Name subset_B1_Name subset_B1_Name subset_B1_Name subset_B2_Name etc...
 
markerID1 "some annotation" expression1-1 confidence1-1 expression1-2 confidence1-2 etc...
 
markerID2 "some annotation" expression2-1 confidence2-1 expression2-2 confidence2-2 etc...
 
etc...
 
 
A second format, identical to the above but with the confidence value columns omitted, is also supported by this file parser.
 
  
 
==Details of Tab-delimited data matrix format==
 
==Details of Tab-delimited data matrix format==

Revision as of 17:49, 17 June 2010

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Overview

This chapter covers:

  • Data file formats supported by geWorkbench
  • Loading microarray data from local files, while merging the data into one data set.
  • Merging data from several previously loaded microarray experiments.


Supported data formats

  • Microarray
    • Affymetrix MAS5/GCOS files - produced by Affymetrix data analysis programs.
    • Affymetrix File Matrix - a geWorkbench spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets. It can contain the data from any number of arrays in a spreadsheet-like "matrix" format. It also allows for the grouping of arrays into named subsets based on phenotypic criteria. Multiple such groups can be defined, each containing a different division of the arrays among named sets. geWorkbench can create files in this format from data read in in other formats.
    • GEO Soft formats
      • series files, e.g. GSE2189_family.soft.
      • curated data matrix (GDS files), e.g. GDS507_sample.soft.
      • series matrix files, e.g. GSE2189_series_matrix.txt.
    • MAGE-TAB data matrix - this is an auxiliary MAGE-TAB data file type that can be used to contain summarized data from multiple arrays.
    • Tab-delimited text (e.g. files exported by RMAExpress or other programs) - A simple columnar file format.
    • Genepix .GPR files - Produced by a popular analysis program for two-color microarrays.
    • Affymetrix CEL files - these files of probe level data can be viewed graphically in geWorkbench but not used directly for analysis.
  • Other
    • FASTA files. DNA or amino-acid sequence files in FASTA format.
    • PDB files - protein 3-dimensional structure files can be viewed in the JMol Viewer in geWorkbench.
    • NetBoost Edge List - used by a component still under development.

Detailed expression file formats

Several of the expression file formats are described in detail on File Formats page.


Details of Tab-delimited data matrix format

This simple tab-delimited format just contains array and marker names and the expression data. It does not contain any annotation, nor does it support any groupings of arrays. Unlike the example above, this format only supports a single value per marker per array. A second, confidence value is not supported.

Leading lines beginning with ! or # are taken to be comments and are discarded. The first line not beginning with ! or # is taken to be the column header line.

The first column contains the marker (e.g. probeset) names. Each name in this column must be unique. The remaining columns contain the data matrix as floating point numbers. The header line contains the array names. The column header for the first column can be any word, here "Probesets".

Probesets	Array Name 1	Array Name 2	Array Name 3	Array Name 4	Array Name 5
markerID1	expression1-1	expression1-2	expression1-3	expression1-4	expression1-5
markerID2	expression2-1	expression2-2	expression2-3	expression2-4	expression2-5
markerID3	expression3-1	expression3-2	expression3-3	expression3-4	expression3-5
markerID4	expression4-1	expression4-2	expression4-3	expression4-4	expression4-5

Some microarray platforms can include multiple markers/probesets for some or many of the genes represented. If a data file contains e.g. gene symbols rather than individual marker names in the first column, any resulting duplicate appearance of such a label will prevent the file from being read in to geWorkbench.

Microrray data and merging datasets

When working with microarray data, all data to be analyzed must be present within one data node in a project. If the data exists as multiple files containing results from single arrays, the data must be merged into a single node before it can be used. geWorkbench can perform this merging step either at the time data is read in, or later in a separate step. Once merged, such a dataset can be saved to disk; it will be saved in the geWorkbench matrix file format.

Data merging will be covered in the local and remote data tutorials.


Tutorial: Loading microarray data files - local

In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset. The origin of these file is described in the section Tutorial_-_Data.

Note that no Affymetrix annotation files are included in the geWorkbench distribution. The annotation file for the HG-U95Av2 array used in this example can be obtained from the Affymetrix website. See the instructions in the FAQ entry on this topic.

1. Right-click on the default Workspace entry in the Project Folders component.

T ProjectFolders NewProj.png


2. Next, right-click on the new Project entry and select Open Files.

T ProjectFolders OpenFiles.png


A file browser will appear with which you can select the files you wish to open. The default is to browse for local files, that is on your own computer. geWorkbench can also access data from caArray databases (the Remote option).

T ProjectFolders OpeningCardio Merge.png


3. Here, we select the file type Affymetrix GCOS/MAS5 as shown.

4. Make sure to check the Merge files checkbox. This will create the merged data node as the files are read in.

5. We will select 10 MAS5 format text files from the directory data/cardiogenomics.med.harvard.edu, which is included within the geWorkbench installation.

6. Click Open.


A message will appear giving information about associating an annotation file with the dataset.


T ProjectFolders Annotations.png


7. A file browser will then open with which you can, if you wish, select an annotation file matching your dataset. This is needed if you intend to use features of geWorkbench such as the Sequence Retriever or GO Terms component (Gene Ontology).


T ProjectFolders ChooseAnnotation.png


A status bar will display as the data is loaded:


T ProjectFolders StatusBar.png


8. The merged data set is listed in the Project folder. The individual arrays are shown below in the Arrays component.


T ProjectFolders CardioLoaded.png

Tutorial: Merging microarray data files after they have already been loaded.

If data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.


1. Select the read-in data files that you want to merge.

2. Click on File in the menu bar, and choose Merge Datasets.

T ProjectFolders MergeDatasets.png


3. The result is a new data node containing the merged data. The original data nodes are still present.

T ProjectFolders MergedData.png