Gene expression time course data is provided for four different strains of yeast (S. Cerevisiae), after perturbation of the cells. The challenge is to predict the rank order of induction/repression of a small subset of genes (the “prediction targets” in one of the four strains, given complete data for three of the strains, and data for all genes except the prediction targets in the other strain. Predictors are also allowed to use any information that is in the public domain but are expected to be forthcoming about what information was used.


Contents

Background

GAT1, GCN4, and LEU3 are yeast transcription factors. Each of these transcription factors has something to do with controlling genes involved in nitrogen or amino acid metabolism. The genes are not essential because strains that have perfect deletions of any of these genes are viable. In this challenge, we provide gene expression data from four strains: (i) a strain that is wild-type for all three transcription factors (wt, or parental), (ii) a strain that is identical to the parental strain except that it has a deletion of the GAT1 gene (gat1Δ), (iii) a strain that is identical to the parental strain except that it has a deletion of the GCN4 gene (gcn4Δ), and (iv) a strain that is identical to the parental strain except that it has a deletion of the the LEU3 gene (leu3Δ).

Expression levels were assayed separately in all four strains following the addition of 3-aminotriazole (3AT). 3AT is an inhibitor of an enzyme in the histidine biosynthesis pathway and, in the appropriate media (which is the case in these experiments) inhibition of the histidine biosynthetic pathway has the effect of starving the cells for this essential amino acid.

Data from eight time points was obtained from 0 to 120 minutes. Time t=0 means the absence of 3AT.


The Challenge

Predict, for a set of 50 genes, the expression levels in the gat1Δ strain in the absence of 3-aminotriazole (t=0) and at 7 time points ( t=10, 20, 30, 45, 60, 90 and 120 minutes) following the addition of 3AT. Absolute expression levels are not required or desired; instead, the fifty genes should be ranked according to relative induction or repression relative to the expression levels observed in the wild-type parental strain in the absence of 3AT.


The Datasets

The files provided for this challenge are detailed below.

The file DREAM3_GeneExpressionChallenge_TargetList.txt is a tab-delimited file that lists the target genes whose relative induction/repression are to be predicted. The first column lists the Affymetrix probeset IDs. The second column lists the corresponding commonly-used gene names, as extracted from files obtained from Affymetrix. This file should also be used as a template for submission of predictions. Consequently, there are headings for eight additional columns (see section on Format of Predictions).

The file DREAM3_GeneExpressionChallenge_ExpressionData.txt is a tab-delimited file that provides the relevant expression data. Columns are labeled, and are summarized here as well. The first column gives the Affymetrix probeset ID. The second column lists the commonly used gene name if there is one for that probeset. The third column represents the absolute expression level (in arbitrary units) for the probeset in the parental strain at time t=0. The next set of 8 columns contains the time course data for the wild-type strain, the following set of 8 columns contains the time course data for the gat1Δ strain, the next set of 8 columns contains the time course data for the gcn4Δ strain, and final set of 8 columns contains the time course data for the leu3Δ strain. Within each set of columns, the time points are t=0, 10, 20, 30, 45, 60, 90 and 120 minutes. The values in all of these columns express transcript levels as the log (base 2) of the ratio of expression in the indicated strain and time point to the expression level in the parental strain at time t=0. Thus, positive values indicate higher levels of expression than is observed for that probeset in the parental strain at time t=0, and negative values indicate lower expression. Data is provided for all probesets and in all strains, and at all time points, except for the 50 probesets (genes) whose expression is to be predicted (DREAM3_GeneExpressionChallenge_TargetList.txt). For those genes, the text “PREDICT” was inserted in the corresponding entries in the columns that correspond to the gat1Δ data in the file DREAM3_GeneExpressionChallenge_ExpressionData.txt.

PLEASE NOTE. The data that is being provided initially is derived from two technical replicates, using a single biological replicate. An additional biological replicate will be obtained soon, and a new version of the DREAM3_GeneExpressionChallenge_ExpressionData.txt file will be provided.


UPDATE NOTE (July 15, 2008)

As noted in the original posting of this challenge, the data set that was provided initially DREAM3_GeneExpressionChallenge_ExpressionData.txt, was based on a single biological replicate, with two technical replicates. We noted that the data file was going to be updated as additional data were obtained. Challenge participants are hereby notified that the original data file has now been superseded by the file

DREAM3_GeneExpressionChallenge_ExpressionData_UPDATED.txt.

The values in this file are based on the original data, plus a new biological replicate. All array data been reprocessed using the RMA algorithm within the commercial program GeneSpring. Probeset hybridization values were median normalized within arrays prior to the calculation of fold-change. This is the dataset that will be used in the evaluation of challenge predictions.

Submission Information

Predictors should make a copy of the file DREAM3_GeneExpressionChallenge_TargetlLst.txt, and rename it

TeamName_ExpressionChallenge.txt,

where TeamName is the name of the team with which you registered for the challenge. Next to the first two columns, which list the probeIDs and gene names of the prediction targets, are eight tab-separated columns labeled “rank time0”, “rank time10” and so on. The genes should be ranked according to predicted fold-induction relative to the expression level for that gene in the wild-type strain at time 0. The gene predicted to have the highest fold-induction should be given the value “1”, and the gene with the greatest fold-repression should be given the value “50”. All other genes should be given rank values in between.

Scoring Metrics

Predictions will be assessed based on rank order metrics such as Spearman’s rank correlation coefficient, and its corresponding p-value under the null hypothesis that the ranks are randomly distributed.


Data Download

Retrieved from "http://wiki.c2b2.columbia.edu/dream/index.php/The_Gene-Expression_Prediction_Challenge._Description"

This page has been accessed 3,965 times. This page was last modified 04:45, 9 September 2008.

x
Find
Browse
The DREAM Project
Community portal
Current events
Recent changes
Random page
Help
Donations
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Create an account or log in
Special pages
New pages
File list
Statistics
Bug reports
More...