Local Data Files

Revision as of 17:28, 1 June 2011 by Smith (talk | contribs) (Prerequisites)

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot


Overview

This chapter covers:

  • Data file formats supported by geWorkbench
  • Loading microarray data from local files, while merging the data into one data set.
  • Merging data from several previously loaded microarray experiments.


Open File Dialog

The Open File dialog can be reached in two ways:

  • Right-click on a Project and select "Open file(s)".
  • From the top-level menu bar, select File->Open->File.

File Open Selector.png

File Browser Controls

The file browser offers standard options

  • Select Directory - A pulldown of directories available at the current directory level. The contents of the directory are listed below when a directory is selected.
  • Up One Level - Move up one directory level.
  • Desktop - Go to the user's desktop or home directory.
  • Create New Folder - create a new directory in the current one.
  • List - Display only the names of files and directories contained in the current directory.
  • Details - Display name, size, type and date of items in the directory.

File Open Selector Controls.png


Other File Open Controls

The selector lists the available file types that geWorkbench can load. Once a file type is chosen, only files matching the indicated file extension, e.g. (*.exp, *.fasta etc.) will be displayed in the available file list.

  • File Name - the file name of the file selected is shown in this field.
  • Files of Type - This pulldown menu lists all available file types that geWorkbench can read in. Available types are listed below. Each item is also followed by its file extension(s). Only files matching the appropriate extension will be displayed in the file list box.
  • Open - Open the selected file.
  • Cancel - Quit the File Open dialog without opening a file.


Merge Files

The Merge Files checkbox can be used if multiple microarray gene expression files are being read in at one time, each representing a single array.

The merge option is only enabled for the various microarray file types. It is disabled for other file types such as PDB, Fasta etc.

When working with microarray data in geWorkbench, all data to be analyzed must be present within one data node in a project. If the data exists as multiple files each containing results from single arrays, the data must be merged into a single node before it can be used. geWorkbench can perform this merging step either at the time data is read in, or later in a separate step. Once merged, such a dataset can be saved to disk; it will be saved in the geWorkbench matrix file format.

An example with individual MAS5 format files is illustrated below.

Local File vs Remote

These two radio buttons control whether the Open dialog is for local data or remote data. This chapter is concerned with opening local data files from disk. See the chapter Remote Data Sources for information on loading remote data from caArray.


Supported data formats

File Open Selector-types.png


  • Adjacency Matrix - An interaction network such as generated using ARACNe.
  • Affy .CEL files - Affymetrix CEL files (probe level data) can be viewed graphically in geWorkbench but not used directly for analysis.
  • Affymetrix File Matrix - a geWorkbench spreadsheet-type multi-experiment format; this is the native file type created by geWorkbench from merged datasets. It can contain the data from any number of arrays in a spreadsheet-like "matrix" format. It also allows for the grouping of arrays into named subsets based on phenotypic criteria. Multiple such groups can be defined, each containing a different division of the arrays among named sets. geWorkbench can create files in this format from data read in in other formats.
  • Affymetrix MAS5/GCOS files - produced by Affymetrix data analysis programs.
  • FASTA files. DNA or amino-acid sequence files in FASTA format.
  • Genepix .GPR files - Produced by a popular analysis program for two-color microarrays.
  • GEO Soft and GEO Series formats
    • GSM - individual sample files.
    • GSE - series files representing an entire experiment, e.g. GSE2189_family.soft.
    • GDS - curated data matrix, e.g. GDS507_sample.soft.
    • Series matrix - matrix format of submitted, uncurated data, e.g. GSE2189_series_matrix.txt.
  • MAGE-TAB data matrix - this is an auxiliary MAGE-TAB data file type that can be used to contain summarized data from multiple arrays.
  • PDB Structure File - Molecular 3-D structure in Protein Data Bank format; can be viewed in the JMol Viewer in geWorkbench..
  • Tab-delimited text (e.g. files exported by RMAExpress or other programs) - A simple columnar file format.

Detailed descriptions of expression file formats

See File Formats for detailed descriptions of some of the file formats used.

GEO Soft files

Multiple Platforms per GEO Soft file

A GEO series file can contain samples from more than one platform (chip type). If geworkbench detects that there is more than one platform represented in the file, it will pop up a dialog box asking the user for which platform to load data. Data from only a single platform type can be loaded into any given microarray data node in geWorkbench.


GEO Soft platform chooser.png


multiple platform format example

The series file GSE6532_family.soft contains data obtained on three different Affymetrix platforms:

!Series_platform_id = GPL96 !Series_platform_id = GPL97 !Series_platform_id = GPL570

The data for individual samples is found in sections beginning with a SAMPLE tag, e.g.:

^SAMPLE = GSM65316

The platform used for the individual sample is given in the sample header by:

!Sample_platform_id = GPL96

geWorkbench will skip samples that are not from the selected platform.


Multiple data types in one Soft file

A GEO series file may contain not only individual sample results, but also a sample table summarizing the expression from all arrays. An example is GSE15139_family.soft. In such a case, where the file has been detected to be a series table (multiple individual sample entries), a sample table will be ignored. The individual sample entries may contain additional data beyond that available in the sample table (e.g. p-values, descriptive sample titles).

Array Names from GEO files

For most microarray dataset types, the array name will either be assigned in the file (e.g. Affy File Matrix) or will be taken from the name of the file (e.g. individual MAS5/GCOS files). However, for GEO Soft format series and series matrix files, we create array names by concatenating the GEO Sample identifier for the particular array with the corresponding sample title. For example, in the file GSE2189_family.soft, sample GSM39801 has the following title line:

!Sample_title = 4_Hr_+MGd_1

The array name in geWorkbench is then GSM39801:4_Hr_+MGd_1.

The below figure shows the sample titles appended to the GEO sample identifiers in the Arrays component (5th line):


GEO Soft array names.png


GEO Platform Annotations Not Read

GEO Soft files can contain marker annotation entries. However, there is not a fixed format for all platforms, so geWorkbench does not attempt to parse this information.

Network Interaction Files

Overview

geWorkbench can read in files that represent a network of interactions between molecules. Some typical interaction types are protein-dna, protein-protein, and miRNA-mRNA. The molecules are referred to as nodes in a network graph, and the connections as edges.

geWorkbench can currently read in the adjacency matrix format generated by ARACNe and the CNKB component.

The adjacency matrix file format is described in the File Formats section. The adjacency matrix includes a numeric value for each edge, which can be used to represent e.g. some measure of strength for the interaction or the confidence that it is real.

At present, a network file must be loaded as the child of a microarray dataset.

Adjacency matrices are loaded by first selecting a Project node, and then using either the menu-bar or right-click File->Open dialog. A microarray dataset chooser allows the user to select to which dataset to add the network.

Loading an Adjacency Matrix File

Prerequisites

  • A microarray dataset compatible with the network must be loaded first.
  • The adjacency matrix can represent the nodes as either markers (probesets) or as gene symbols. Only one or the other convention can be used in one file.
    • If marker names are used, they must match the marker names used in the microarray dataset.
    • Similarly, if gene symbols are used, an annotation file must have been loaded together with the microarray dataset, and the annotation file and network should use the same gene symbol convention.

Sample adjacency network files:

Example

Two microarray datasets have already been loaded.

Network Project Folder.png


  • Select the project and then right-click. The Open File dialog will appear.
  • Select the file type "Adjacency Matrix". (Only files with a suffix ".txt" are shown).
  • Browse to and select the desired network file, and click "Open".


Network Open File.png


  • After the network has been chosen, a second dialog with a menu listing all available microarray datasets will appear.


Network Select Dataset.png


  • Choose the dataset to which to add the network. Here we use the Bcell-100 network.


Network Select Dataset2.png


The network is shown below loaded as a child of the chosen dataset in the Project Folders component.


Network Loaded.png

Network Size Limitation for Viewing

If the Cytoscape component has been loaded (see the CCM), the network will appear there. However, be aware that there are limits as to how large a network can be loaded into Cytoscape.

Microarray Data Annotation Files

If a microarray data file is being loaded, the dialog will give the user the option of also loading an annotation file. These files associate the individual markers, e.g. probesets, with information such as gene name, Entrez ID, GO Terms etc.

Currently geWorkbench only accepts Affymetrix format annotation files, in CSV format. See File Formats for further details.


File Open Annotation Information Popup.png


The file browser is used to locate an annotation file.


File Open Annotation File Browser.png

Example: Loading microarray data files - local

In this example, we will load 10 individual Affymetrix MAS5 format files, and merge them into a single dataset. The origin of these file is described in the section Tutorial Data.

Note that no Affymetrix annotation files are included in the geWorkbench distribution. The annotation file for the HG-U95Av2 array used in this example can be obtained from the Affymetrix website. See the instructions in the FAQ entry on this topic.

1. Right-click on the default Workspace entry in the Project Folders component.

T ProjectFolders NewProj.png


2. Next, right-click on the new Project entry and select Open Files.

T ProjectFolders OpenFiles.png


A file browser will appear with which you can select the files you wish to open. The default is to browse for local files, that is on your own computer. geWorkbench can also access data from caArray databases (the Remote option).

T ProjectFolders OpeningCardio Merge.png


3. Here, we select the file type Affymetrix GCOS/MAS5 as shown.

4. Make sure to check the Merge files checkbox. This will create the merged data node as the files are read in.

5. We will select 10 MAS5 format text files from the directory data/cardiogenomics.med.harvard.edu, which is included within the geWorkbench installation.

6. Click Open.


A message will appear giving information about associating an annotation file with the dataset.


File Open Annotation Information Popup.png


7. A file browser will then open with which you can, if you wish, select an annotation file matching your dataset. This is needed if you intend to use features of geWorkbench such as the Sequence Retriever or GO Terms component (Gene Ontology).


T ProjectFolders ChooseAnnotation.png


A status bar will display as the data is loaded:


T ProjectFolders StatusBar.png


8. The merged data set is listed in the Project folder. The individual arrays are shown below in the Arrays component.


T ProjectFolders CardioLoaded.png


Note - if you have created a custom Affymetrix-format annotation file which has more than one entry for a probeset, an error dialog will be shown.

Annotation Parser Handle Duplicates.png

See File_Formats#Affymetrix_Annotation_Files for further details.

Example: Merging microarray data files after they have already been loaded

If data files are not merged at the time they are read in, they can also be merged later, as long as they are from the same chip type.


1. Select the read-in data files that you want to merge.

2. Click on File in the menu bar, and choose Merge Datasets.

T ProjectFolders MergeDatasets.png


3. The result is a new data node containing the merged data. The original data nodes are still present.

T ProjectFolders MergedData.png