Contents |
BCL6 Transcriptional Target Prediction (DREAM2, Challenge 1)
This archival page describes the challenge exactly as it was presented to the participants. Go to the main DREAM2 Challenge 1 page to download data, view team rankings, cite this work, etc.
Synopsis
A number of potential transcriptional targets of BCL6, a gene that encodes for a transcritpion factor active in B cells, have been identified with ChIP-on-chip data and functionally validated by perturbing the BCL6 pathway with CD40 and anti-IgM, and by over-expressing exogenous BCL6 in Ramos cell. We subselected a number of targets found in this way (the "gold standard posititve" set), and added a number decoys (genes that have no evidence of being BCL6 targets, named the "gold standard negative" set), compiling a list of 200 genes in total. Given this list of 200 genes, the challenge consists of identifying which ones are the true targets and which ones are the decoys, using an independent panel of gene expression data.
Dataset
File with Gene IDs: The file BCL6_targets_and_decoys.xls contains the Entrez GeneIDs (first column) along with the corresponding Affymetrix HGU95Av2 GeneChip probe sets corresponding to each gene (second column). When more than one probe set is associated with the same Entrez GeneID, the probe sets are separated by two slashes: //. Some of these genes are true transcriptional targets of BCL6. For the remaining genes, there is no evidence that they are BCL6 targets. To determine which of these 200 genes are BCL6 targets and which are not, you can use sequence information, gene ontology annotations, or any other tool you consider appropriate. You can also use the microarray data described below.
Microarray Data: A panel of 336 Affymetrix HGU95Av2 GeneChip arrays probing B cells under different conditions can be accessed from the Gene Expression Omnibus database at http://www.ncbi.nlm.nih.gov/geo/ by querying for GEO accession GSE2350. Files with MAS5 normalized data in matrix format can be downloaded by clicking on the “Series Matrix File(s)” link near the bottom of the page. Note that there are two files (GSE2350_series_matrix-1.txt.gz with 255 chips and GSE2350_series_matrix-2.txt.gz with 81 chips) that must be joined together to utilize the entire dataset. In each file, the data matrix begins after a series of header lines which begin with the character ‘!’. In the first line, the entries after “ID_REF” are column headers listing the name of each sample. All succeeding lines (until the last one) contain a probe set ID followed by a series of numbers corresponding to the MAS5 normalized intensity values for this probe set and the corresponding sample listed on the column header. The file is terminated by the line “!series_matrix_table_end”. Alternatively, raw data in .cel format can be downloaded by clicking on the GSE2350_RAW.tar link. This data can then be normalized using the method of your choice.
Useful Information: In the HGU95Av2 GeneChip, BCL6 (Entrez GeneID 604) is represented by three probe sets: 40091_at//978_at//979_g_at.
Submission Information
Submit a ranked list of genes, ordered according to the confidence you assign to your prediction that a gene is a true BCL6 transcriptional target, from the most reliable (first row) to the least reliable (last row) prediction. Use a tab-separated 2 column format as in the example below:
- nnn \tab XYZ
where nnn is one of the Entrez GeneID identifiers in the file BCL6_targets_and_decoys.xls, and XYZ is a score between 0 and 1 that indicates the confidence level you assign to the prediction that a gene is a true BCL6 transcriptional target. (E.g., XYZ=1 if the gene is deemed to be a target with highest confidence and XYZ = 0 if a gene is deemed not to to be a target.) All genes omitted from the list but that belong to the gold standard positive and gold standard negative sets will be considered to appear randomly ordered at the end of the list with XYZ = 0. Save your prediction file as unformatted text, and name it:
- TeamName_BCL6targets.txt
where TeamName is the name of the team with which you registered for the challenge.
Scoring Metrics
We will score the results using the area under the precision versus recall curve for the whole set of predictions. For the first k predictions (ranked by score, and for predictions with the same score, taken in the order they were submitted in the prediction file), precision is defined as the fraction of correct gold standard positive predictions to k, and recall is the proportion of correct gold standard positive predictions out of all the possible gold standard positive targets. Other metrics such as precision at 1%, 10%, 50%, and 80% recall, and the area under the ROC curve will also be evaluated.
