File Formats

Revision as of 15:50, 28 June 2011 by Smith (talk | contribs) (Pattern Discovery Export Format)

Home | Quick Start | Basics | Menu Bar | Preferences | Component Configuration Manager | Workspace | Information Panel | Local Data Files | File Formats | caArray | Array Sets | Marker Sets | Microarray Dataset Viewers | Filtering | Normalization | Tutorial Data | geWorkbench-web Tutorials

Analysis Framework | ANOVA | ARACNe | BLAST | Cellular Networks KnowledgeBase | CeRNA/Hermes Query | Classification (KNN, WV) | Color Mosaic | Consensus Clustering | Cytoscape | Cupid | DeMAND | Expression Value Distribution | Fold-Change | Gene Ontology Term Analysis | Gene Ontology Viewer | GenomeSpace | genSpace | Grid Services | GSEA | Hierarchical Clustering | IDEA | Jmol | K-Means Clustering | LINCS Query | Marker Annotations | MarkUs | Master Regulator Analysis | (MRA-FET Method) | (MRA-MARINa Method) | MatrixREDUCE | MINDy | Pattern Discovery | PCA | Promoter Analysis | Pudge | SAM | Sequence Retriever | SkyBase | SkyLine | SOM | SVM | T-Test | Viper Analysis | Volcano Plot



Affymetrix MAS5/CCOS Format

These are text files (specifically, text versions of .CHP files) produced by the MAS software from Affymetrix. The image below provides an example of an input file that the application will accept as a correctly formatted "Affymetrix MAS5/GCOS" file type (only the first 18 lines of the file are shown):


AffyMas5GcosFormat.png


Any number of lines can precede the actual array data (in the example above the first 11 lines are non-data lines). All such lines will be ignored when the file is parsed. The beginning of the actual data is marked by a row containing tab-separated column names. Any number of columns may be present; however, only the following columns will be acted upon (column names must be spelled exactly as listed below and cannot contain tabs):

  • Probe Set Name: This column is mandatory; the file will fail to be parsed unless this column is present. Its contents are strings that provide the marker names associated with a microarray set (AFFX-BioB-5_at, AFFX-MurIL10_at, etc).
  • Signal Log Ratio: This column is optional. If present, its contents must be real numbers and will be used as the expression measurements for the corresponding markers.
  • Signal: Same as column "Signal Log Ratio" above. If both columns are present, only "Signal Log Ratio" is used ("Signal" will be ignored)
  • Avg Diff: Same as columns "Signal Log Ratio" and "Signal". If either "Signal Log Ratio" or "Signal" is present then "Avg Diff" is ignored and the contents of those columns are used instead (with "Signal Log Ratio" taking priority over "Signal"). At least one of these 3 columns ("Signal Log Ratio", "Signal", "Avg Diff") must be present for the file to parsed in a meaningful way. if this is not the case, the file will still be read in but all markers will be tagged as having missing values.
  • Detection: This column is optional. Its value is a single character (P, M, or A) and if present it is used to determine the absolute call for a measurement (A -> absent, M -> marginal, P -> present).
  • Abs Call: Exactly the same as column "Detection" (for compatibility with text files produced by older versions of the MAS software).
  • Detection p-value: This column is optional. Its values are real numbers between 0 and 1. It provides a measure of confidence in the quality of the measurement reading (the smallest the p-value the higher the confidence). Further, if both "Detection" and "Abs Call" columns are missing, the "Detection p-value" is used as an alternative method to determine the absolute calls for marker measurements:
    • If p-value < 0.04, then the absolute call is "present"
    • If 0.04 <= p-value < 0.06, then the absolute call is "marginal".
    • If p-value >= 0.06, then the absolute call is "missing".

The p-value thresholds used above are chosen to be the same as the default thresholds used for the same purpose by the MAS software (see the manual "GeneChip Expression Analysis: Data Analysis Fundamentals" from Affymetrix, http://www.affymetrix.com/support/downloads/manuals/data_analysis_fundamentals_manual.pdf).

The marker data follow the column names. Each row correspond to one marker. Values within a row are tab-separated.


Affymetrix File Matrix Format (geWorkbench)

This is an example of an input file that the application will accept as a correctly formatted "Affymetrix File Matrix " file type (only the first 8 lines of the file are shown):


AffyFileMatrix.png


Files in this format can contain data for more than one microarray (in the example above there are 3 microarrays named CB26-2, CB511 and CB512). Any number of comment lines, each one starting with a #, can precede the actual array data (however, no empty lines are allowed before, after or between comment lines). All comment lines will be ignored during file parsing. The beginning of the actual data is marked by the first row that does not begin with a # character. This row is expected to have N+2 tab-separated column names. The first column name must read "AffyId"; the second column name must read "Annotation". The subsequent N entries are assumed to be the labels of the N microarrays whose data are contained in the file. Any string can be used as a microarray label as long as it does not contain tab characters.

Then a number of "Description" lines follow (zero or more of them). Description lines can be used to group microarrays into array sets (see online help section on "Markers and Phenotypes" for a description of array sets; or read the online tutorial titled "Data Subsets" in the Tutorials section of the geWorkbench web site, http://www.geworkbench.org). Each line describes one collection of array sets and comprises N+2 tab separated entries. The first entry must be the word "Description". The second entry is the label that will be used in the application to identify this collection of array sets (users can select which collection to work with from the "Array/Phenotype Sets" drop-down menu within the Arrays/Phenotypes application component). The following N columns are array set labels that map one-to-one to the N arrays listed in the "AffyID" line; their relative order in the description line determines how the arrays are grouped into sets. E.g., in the example above the first description line defines a collection of 2 array sets: the first set is called "Cord blood" and contains the arrays CB26-2 and CB511. The second set is called "pFL" and contains the array CB512. The collection of these two array sets is itself named and is called "short designation".

The last part of the file contains the actual expression measurements. There is one line per marker on the chip (there are 4 such lines in the example). Each line starts with the name of the marker. The name is followed by a string that provides a human readable annotation of the marker (this entry must contain a non-empty string; if there is no meaningful annotation to be associated with a marker then the marker name may replicated here or a series of one or more space characters can be used). The measurement values follow the annotation string. This portion of a data line can assume one of two possible forms:

  • Measurements and p-values: there are 2N tab separated decimal values present (the example above has this form). These values are parsed in consecutive pairs. The i-th pair provides data for the value of the marker on the i-th array and comprises (a) the expression measurement for the marker on the array and (2) a p-value (between 0 and 1) indicating the strength of the call. The p-value is used by the application to infer the detection call for the corresponding measurement. The algorithm used for this purpose is the same as the one used in the Affymetrix MAS5/CCOS Format to infer detection calls from the value of the "Detection p-value" column.
  • Measurements only: there are N tab separated decimal values present. The i-th values provides the expression measurement for the marker at the i-th microarray. All measurements are treated as having a detection call of "Present".


Notes

  1. In the case where there are two data values per array, e.g. signal and p-value, the format is difficult to create by hand because the header columns do not directly label or line up with the data columns.
  2. Example file: An example of the first 14 lines of the file Bcell-100.exp (used in many examples in these tutorials) is available here.


Tab-delimited format

This is an example of a file that the application will accept as a correctly formatted "Tab-delimited " file type (only the first 20 lines of the file are shown), as output by the program RMAExpress:


Tab-delimited Format.png


Any number of comment lines, each one starting with a # or a !, can precede the actual array data (however, no empty lines are allowed before, after or between comment lines). All comment lines will be ignored during file parsing. The beginning of the actual data is marked by the first row that does not begin with a # or a ! character. This row is expected to have N+1 tab-separated column names. The first column name can be any character (in the example above it has the value "ID"). The first column name cannot be left blank or be occupied with a white space (for example, a header cannot start with "\t").  The remaining N entries are assumed to be the names of the microarrays comprising the microarray set (in the example above the microarray names are "alpha_factor_release sample013", "alpha_factor_release_sample014" and "alpha_factor_release_sample015").

Subsequent lines contain the actual data. Each line corresponds to a single marker and consits of N+1 tab-separated entries. The first entry is the marker name (a string). The remaining N entries are real numbers providing the expression level of the marker at each of the microarrays in the set. A missing value or a non-real value results at a measurement being marked as "missing".


Notes

  1. This simple tab-delimited format just contains array and marker names and the expression data. It does not contain any annotation, nor does it support any groupings of arrays.
  2. This format only supports a single value per marker per array. A second, confidence value is not supported.
  3. Some microarray platforms can include multiple markers/probesets for some or many of the genes represented. If a data file contains e.g. gene symbols rather than individual marker names in the first column, any resulting duplicate appearance of such a label will prevent the file from being read in to geWorkbench.

GenePix Format

Each GenePix file contains data for a differential expression experiment involving a single microarray. A detailed description of the GenePix .gpr format can be found at the manufacturer's site.

The top portion of the file (the GPR Header) contains information about image acquisition and analysis. This part is ignored during parsing. From the actaul data portion of the file only the following data columkns are parsed:

  • ID: The contents of this column are used as the marker names.
  • F635 Median, F635 Mean, B635 Median, B635 Mean, F532 Median, F532 Mean, B532 Median, B532 Mean: The values of these columns are used for calculating the composite expression measurement for a marker. There are 4 possible options for performing this calculation (the Tools->Preferences menu can be used to specify which of the 4 to use):
  1. ( F635 Mean - B635 Mean) / (F532 Mean - B532 Mean).
  2. ( F635 Median - B635 Median) / (F532 Median - B532 Median).
  3. ( F532 Mean - B532 Mean) / (F635 Mean - B635 Mean).
  4. ( F532 Median - B532 Median) / (F635 Median - B635 Median).
  • Flags: Parsed as an arbitrary string. The Flags value associated with a marker can be used to filter out the marker when using the "Genepix Flags" filter (for details see the online help section on Filters).


Annotation Files

Currently geWorkbench only supports the Affymetrix annotation file format. This is a CSV (comma separated value) format. Rows are individual probesets, and each column contains a different annotation type.

Affymetrix annotation files for their platforms can be obtained directly from the Affymetrix support website (registration required). These files are updated quarterly.

geWorkbench expression file parser support for Affymetrix annotation files

These geWorkbench file parsers will accept an Affymetrix-format annotation file during the data loading process:

  • Affymetrix file matrix (geWorkbench native format)
  • MAS5/GCOS (Affymetrix)
  • GEO SOFT
    • GSM - sample
    • GSE - series, a series of samples representing an experiment
    • GDS - data set, a curated data matrix
  • GEO Series Matrix
  • MAGE-TAB data matrix
  • Tab-delimited (e.g. RMAExpress, etc).

The geWorkbench GenePix file parser does not support a separate annotation file.

Creating a custom annotation file

For a non-Affymetrix platform, you can create your own custom annotation file using only the annotation columns needed for the analyses you intend to perform in geWorkbench.

The geWorkbench annotation file parser recognizes the column headers shown in the next section. All columns are optional, however, if "probeset id" is not present, none of the annotation records will link to the expression file.

Recognized Column Headers

  • "Probe Set ID"
  • "Species Scientific Name"
  • "Archival UniGene Cluster"
  • "UniGene ID"
  • "Genome Version"
  • "Alignments"
  • "Gene Title" (e.g. Epidermal growth factor receptor)
  • "Gene Symbol" (e.g. EGFR)
  • "Entrez Gene"
  • "SwissProt"
  • "RefSeq Transcript ID"
  • "Gene Ontology Biological Process"
  • "Gene Ontology Cellular Component"
  • "Gene Ontology Molecular Function"
  • "Pathway"
  • "Transcript Assignments"

For fields that contain multiple values, the delimiter between the values is "///".

Parsing Errors

The geWorkbench parser will check for whether a given probeset occurs more than one time in the annotation file. If it does, an error dialog will appear, offering three choices:

  • Skip this duplicate annotation entry.
  • Skip all duplicate annotation entries.
  • Cancel - do not load the annotation file.

This dialog is shown in the following image:

Annotation Parser Handle Duplicates.png


Network Formats

Tools such as ARACNe, the CNKB, and Master Regulator Analysis (MRA) make use of network interaction files. ARACNe creates "adjacency matrix" (ADJ) format files, and MRA reads them. The CNKB can export complete sets of interactions in either the ADJ or SIF format.

SIF format

The Simple Interaction Format (SIF) was developed for Cytoscape.

For a full definition see the Cytoscape manual, for example Ctyoscape manual v2.8

Each line contains interactions of a particular type for the first node with one or more target nodes:

  • node1 interaction-type-code node2 node3 node4 etc.

Some interaction-type-codes used in the CNKB are

  • pp protein-protein
  • pd protein-DNA
  • tm modulator-TF

Adjacency matrix (ADJ) format

The is is format used by ARACNe.

For each of the interactions in which node1 takes part,

  • node1 node2 value2 node3 value3 node4 value4 etc....

where valueN can be for example the mutual information, a confidence value etc.


Misc. Formats

MRA Export format

The data is saved in blocks of comma separate values, one block per "master regulator". The first line of each block is the marker id and the gene name of the master regulator. Subsequent lines contain, in addition, a zero, and then the t-value for that markers. The position with zero previously contained a p-value which is no longer calculated.

220462_at, CSRNP3
200660_at, S100A11, 0.0, 8.681083
201474_s_at, ITGA3, 0.0, 6.0778093
202910_s_at, CD97, 0.0, 7.482091
203370_s_at, PDLIM7, 0.0, 7.235049
205479_s_at, PLAU, 0.0, 9.105933
208540_x_at, S100A11, 0.0, 8.539139
208690_s_at, PDLIM1, 0.0, 7.2551293
210735_s_at, CA12, 0.0, 6.954225
211924_s_at, PLAUR, 0.0, 6.1465526
214853_s_at, SHC1, 0.0, 9.554302
202614_at, SLC30A9
160020_at, MMP14, 0.0, 4.3422446
200808_s_at, ZYX, 0.0, 6.7016177
200859_x_at, FLNA, 0.0, 6.5627837
201315_x_at, IFITM2, 0.0, 6.2328486
201389_at, ITGA5, 0.0, 6.451033
201883_s_at, B4GALT1, 0.0, 3.3089304
202888_s_at, ANPEP, 0.0, 4.060898
203149_at, PVRL2, 0.0, 4.0649285
203370_s_at, PDLIM7, 0.0, 7.235049
205936_s_at, HK3, 0.0, 3.5691133
207667_s_at, MAP2K3, 0.0, 5.186854
208690_s_at, PDLIM1, 0.0, 7.2551293
209359_x_at, RUNX1, 0.0, 6.7081113
211924_s_at, PLAUR, 0.0, 6.1465526
211966_at, COL4A2, 0.0, 7.394549
214752_x_at, FLNA, 0.0, 7.3759694
214866_at, PLAUR, 0.0, 6.628238
215498_s_at, MAP2K3, 0.0, 7.774499
215706_x_at, ZYX, 0.0, 6.2943435
217691_x_at, SLC16A3, 0.0, 5.4462223
221807_s_at, TRABD, 0.0, 3.2993512
35626_at, SGSH, 0.0, 6.4077697

Pattern Discovery Export Format

  1. sequence number
  2. pattern
  3. [parameters]
  4. [line, hit position] (both numbers are zero-based)
discovery
File:C:\Users\ksmith\Desktop\geWB_data\fasta\H1H5_HistoneDB_NHGRI.fasta
[0]	LKER.GSS	[4,4,7,209.92124911313488]	[0,57][1,67][2,68][4,59]
[1]	L.QTKG.GASGSFKLS	[3,3,14,604829.8848207161]	[0,100][3,94][4,102]
[2]	A.AKKP.AK	[4,4,7,172.99810658850754]	[0,239][1,225][2,225][4,245]
[3]	AT.KKP.AK	[4,4,7,172.99810658850754]	[0,239][1,132][2,133][4,245]
[4]	KKP.AKKA	[3,3,7,11.941700055286251]	[1,228][2,228][4,165]
[5]	AA.KKA.AAA	[3,3,8,54.52686758911383]	[1,193][2,194][3,61]