Transcription-Factor/DNA-Motif Recognition Challenge
DREAM5, Challenge 2
Transcription factors (TFs) control the expression of genes through sequence-specific interactions with genomic DNA. Different TFs bind preferentially to different sequences, with the majority recognizing short (6-12 base), degenerate ‘motifs’. Modeling the sequence specificities of TFs is a central problem in understanding the function and evolution of the genome, because many types of genomic analyses involve scanning for potential TF binding sites. Models of TF binding specificity are also important for understanding the function and evolution of the TFs themselves.
Ideally, models of TF sequence binding specificity should predict the relative affinity (e.g. dissociation constant) to different individual sequences, and/or the probability of occupancy at any position in the genome. Currently, the major paradigm in modeling TF sequence specificity is the Position Weight Matrix (PWM) model. However, it is increasingly recognized that shortcomings of PWMs, such as their inability to model gaps, to capture dependencies between the residues in the binding site, or to account for the fact that TFs can have more than one DNA-binding interface, can make them inaccurate (Benos et al. 2002; Badis et al. 2009; Maerkl and Quake 2007). Alternative models that address some of the shortcomings of PWMs have been developed (Sharon et al. 2008; He et al. 2009; Zhao et al. 2005), but their relative efficacies have not been directly compared.
A major difficulty in studying TF DNA-binding specificity has been scarcity of data. The process of training and testing models benefits from a large number of unbiased data points. In the case of TF binding models, the required data is the relative preference of a TF to a large number of individual sequences. Recently, Protein Binding Microarrays (PBMs) have been developed for the purpose of determining TF sequence preferences (Berger et al. 2006). The resulting data provide a quantitative score representing the relative binding affinity of a given TF to the sequence of each probe contained on the array. PBM data have produced some of the strongest evidence for the inadequacy of PWM models (Badis et al. 2009), and also provide extensive training/test data.
Given the output of probe intensities of one PBM array type, this challenge consists of predicting the probe intensities of a second array type. Each array consists of ~41,000 60-base probe sequences (each containing 35 unique bases); the two array types have completely different probe sequences. Contestants may base their predictions on any type of model (e.g. position weight matrices), but the type of model and its details must be specified in order to correctly categorize the method for evaluation purposes.
Modeling transcription factor DNA-binding activities is an active field, and there are many open questions. PBMs provide an opportunity to evaluate motif models and other representations. Each PBM is designed using de Bruijn sequences, such that all possible 10-mers, and 32 copies of every non-palindromic 8-mer are contained on each array, offering an unbiased survey of TF binding preferences. The two types of array designs (“ME” and “HK”) were constructed using different de Bruijn sequences: “HK” is one type, “ME” is another. Therefore, the two arrays have completely different probes from each other.
Current approaches to modeling PBM data first break the set of 35-mer probe signals into multiple measurements for each 8-mer, and summarize these measurements using either Z-scores (significance estimates based on the normal distribution of intensities) or E-scores (rank-based, non-parametric statistics created using a method similar to the Wilcoxon rank-sum test). The resulting data provides a “lookup table” summarizing the relative affinity of a given TF to each possible 8-base sequence, and is often converted to a position weight matrix by (for example) aligning all 8-mer sequences with significant scores. More advanced methods have been proposed for modeling PBM data, such as Seed and Wobble (Berger et al. 2006) and RankMotif++ (Chen et al. 2007). However, it is still an open question how to best use PBM data to model the binding preferences of a given TF.
The dataset for this challenge describes the binding preferences of 86 mouse TFs (representing a wide range of TF families) in the form of probe intensity signals. For 20 TFs, data (the training set) is provided from both HK and ME array types, for “practice” and method calibration. The challenge consists of predicting the signal intensities for the remaining 66 TFs. For 33 TFs, data will be provided from array type “ME”; data for the other 33 TFs will be provided for array type “HK”. Released data is to be used for model learning, and unreleased data will be used for evaluation purposes.
Three data sets will be provided in tabular form with \tab separated columns. In all cases, data columns correspond to output from the software package GenePix Pro version 6.0: “signal” corresponds to feature pixels, “background” corresponds to background pixels and "Flag" is a binary probe-quality field. For the "Flag" field, 0 means the probe quality is good whereas 1 means the probe was flagged as bad due to dust specs, scratches, or other imperfections. The data files are:
- DREAM5_PBM_Data_TrainingSet.txt, contains Protein Binding Microarray data for 20 TFs, including both “ME” and “HK” array types. The first column headed ID, lists the name of the TF, and the second column lists the array type. The third column lists the probe sequence, and subsequent columns list the signal and background data. This is represented in the following table:
ID Array Sequence Signal Background Signal Background Signal Background Flag Type Mean Mean Median Median Std Std Egr2 ME CATGTAAGAAGTTATCCTGGCTGTCTAATG 15926 1030.75 18592.5 273.00 5495 1635.32 0 CCGCTCCTGTGTGAAATTGTTATCCGCTCT Egr2 ME TTGCTCATCAGATCGCGCTAACAGGCTTTC 17487 760.59 20249.0 265.50 7285 1077.09 0 ACTTACCTGTGTGAAATTGTTATCCGCTCT ... Egr2 HK GCCAGTTTAGGTGGCGCCCGGAACCCTTAA 2972.4 574.22 2928.00 391.50 799.5 515.27 1 CCCATCCTGTGTGAAATTGTTATCCGCTCT Egr2 HK CATGTAGAGCCCTAAAACTGGGACTAAGCC 3552.3 608.46 3697.00 352.00 869.8 642.76 0 GACCTCCTGTGTGAAATTGTTATCCGCTCT ... Foxp2 ME CATGTAAGAAGTTATCCTGGCTGTCTAATG 27336 3457.05 27283.0 1004.00 5440 5076.35 0 CCGCTCCTGTGTGAAATTGTTATCCGCTCT Foxp2 ME TTGCTCATCAGATCGCGCTAACAGGCTTTC 54822 6635.14 56181.0 1306.00 8275 10833.50 0 ACTTACCTGTGTGAAATTGTTATCCGCTCT ... Foxp2 HK GCCAGTTTAGGTGGCGCCCGGAACCCTTAA 36935 8580.54 37738.0 3724.00 2782 10308.83 0 CCCATCCTGTGTGAAATTGTTATCCGCTCT Foxp2 HK CATGTAGAGCCCTAAAACTGGGACTAAGCC 33758 6616.34 34661.0 2466.50 4111 8178.01 0 GACCTCCTGTGTGAAATTGTTATCCGCTCT ...
- DREAM5_PBM_Data_Needed_For_Predictions.txt, contains Protein Binding Microarray data for 66 TFs. The TFs are indicated as TF_1, TF_2, ... TF_66 in the first column. For TF ranging from TF_1 to TF_33, only data from array type HK is given, as indicated in the second column of the data file. Similarly, for TF ranging from TF_34 to TF_66, only data from array type ME is provided. Subsequent columns list the signal and background data of the corresponding PBMs. This is represented in the following table:
ID Array Sequence Signal Background Signal Background Signal Background Flag Type Mean Mean Median Median Std Std TF_1 HK CTCTGTAAGTCAGGGTGACTCGAGCGGATC 4572.9 707.33 4674.00 406.00 987.4 715.31 0 ACCTGCCTGTGTGAAATTGTTATCCGCTCT TF_1 HK AGGTGGGTCCAATTATCCGATCTCACGTCG 4989.0 802.11 5126.00 473.00 810.6 800.87 0 ACCTTCCTGTGTGAAATTGTTATCCGCTCT ... TF_33 HK GCCAGTTTAGGTGGCGCCCGGAACCCTTAA 913.42 599.35 830.00 383.00 378.0 538.39 0 CCCATCCTGTGTGAAATTGTTATCCGCTCT TF_33 HK CATGTAGAGCCCTAAAACTGGGACTAAGCC 849.74 336.91 790.00 193.00 305.7 397.59 1 GACCTCCTGTGTGAAATTGTTATCCGCTCT TF_34 ME CATGTAAGAAGTTATCCTGGCTGTCTAATG 12895 2739.14 13197.0 1159.00 1696 2893.35 0 CCGCTCCTGTGTGAAATTGTTATCCGCTCT TF_34 ME TTGCTCATCAGATCGCGCTAACAGGCTTTC 16045 2054.40 16319.0 709.00 3558 2556.29 0 ACTTACCTGTGTGAAATTGTTATCCGCTCT ... TF_66 ME CATGTAAGAAGTTATCCTGGCTGTCTAATG 3674.6 347.45 3974.50 147.00 1008 460.29 0 CCGCTCCTGTGTGAAATTGTTATCCGCTCT TF_66 ME TTGCTCATCAGATCGCGCTAACAGGCTTTC 3793.1 390.47 4073.00 150.00 1133 557.11 0 ACTTACCTGTGTGAAATTGTTATCCGCTCT
Note about bad microarray spots
Microarray flags (i.e. bad spots - dust, scratches) were omitted from the originally posted data (both in files (files DREAM5_PBM_Data_TrainingSet.txt and DREAM5_PBM_Data_Needed_For_Predictions.txt). If these files were downloaded prior to June 8, 2010, they did not contain a "Flag" column. Data files with the flag column were posted on June 8, 2010. Typically much less than 1% of spots are flagged, but since these spots are suspect and may have aberrantly high or low intensity, they should be masked in training data. Flagged spots will also not be considered in the evaluations.
Important information regarding measurements and PBM array types
- (a) The methodology for generating these data follows Berger 2006 and 2009 (Berger et al. 2006; Berger et al. 2009) with modifications. The arrays (ME and HK) were designed by Julian Mintseris and Mike Eisen (ME array, following Mintseris and Eisen 2006), and Hilal Kazan (HK array, following methodology described by Philippakis et al. 2008).
- (b) The data is reported in arbitrary (fluorescence) units, and represents the abundance of the transcription factor in each PBM spot.
- (c) Normalization: The data is not normalized with respect to how much DNA there is on each spot on the slide. In some instances the amount of DNA in each spot was measured by a Cy3 channel, which detects the amount of Cy3-labeled dUTP added in the double-stranding step of array generation. There is no clear evidence that the information of the Cy3 channel is useful in the data processing, and therefore we decided not to add this information to the main challenge data. However, some of Cy3 fluorescence data is provided as Supplementary Data (see item (d) below).
- (d) For about 30% of the measurements of this challenge, information is available pertaining to the Cy3 data. Participants who want to use the Cy-3 data, as well as the microarray spot maps can find this additional information in the file DREAM5_PBM_SupplementaryData.zip (which also contains a README.doc file describing its contents) from the DREAM5 download site. Important: You may choose to ignore the supplementary information in the formulation of your predictions. The supplemental data is provided for the sake of a comprehensive data archive for the challenge, and for those whose algorithms may make use of it. Note: Among the supplementary files are the microarray grid layout maps (files in DREAM5_PBM_Data_GridFiles.zip). Minor changes were incorporated to the grid layout maps after June 8, 2010. [Either one or two lines in each file (out of ~40,000) were slightly changed with respect to the data posted prior to June 8, so that the probe sequences match up better with the sequences in the other files.] Files downloaded prior to June 8, 2010 do not contain these mild updates in the layout maps.
Participants are required to submit 2 files:
(1) For each transcription factor TF_1 to TF_33, please submit your predictions of “Signal Mean” for the probe sequences of array type ME. Likewise, for each transcription factor TF_34 to TF_66 please submit your predictions of “Signal Mean” for the probe sequences in the array type HK. Submit your predictions using the template file DREAM5_PBM_TeamName_Predictions.txt. The contents of this file are as follows:
ID Array Sequence Signal Type Mean TF_1 ME CTCTGTAAGTCAGGGTGACTCGAGCGGATCACCTGCCTGTGTGAAATTGTTATCCGCTCT ? TF_1 ME AGGTGGGTCCAATTATCCGATCTCACGTCGACCTTCCTGTGTGAAATTGTTATCCGCTCT ? ... TF_33 ME GCCAGTTTAGGTGGCGCCCGGAACCCTTAACCCATCCTGTGTGAAATTGTTATCCGCTCT ? TF_33 ME CATGTAGAGCCCTAAAACTGGGACTAAGCCGACCTCCTGTGTGAAATTGTTATCCGCTCT ? TF_34 HK CATGTAAGAAGTTATCCTGGCTGTCTAATGCCGCTCCTGTGTGAAATTGTTATCCGCTCT ? TF_34 HK TTGCTCATCAGATCGCGCTAACAGGCTTTCACTTACCTGTGTGAAATTGTTATCCGCTCT ? ... TF_66 HK CATGTAAGAAGTTATCCTGGCTGTCTAATGCCGCTCCTGTGTGAAATTGTTATCCGCTCT ? TF_66 HK TTGCTCATCAGATCGCGCTAACAGGCTTTCACTTACCTGTGTGAAATTGTTATCCGCTCT ?
Changes made on 9/13/2010 to allow for faster upload:
- Upon submission, participants must replace the "?" sign in file DREAM5_PBM_TeamName_Predictions.txt, by their predictions of the "Signal_Mean" for the probe sequence in the corresponding row.
- Upon submission, participants must delete the third column (Sequence) from file DREAM5_PBM_TeamName_Predictions.txt, maintaining the original order of the file.
- Columns must be \tab separated.
- Check the format of your file with this validation script (updated September 15th, 2010).
- Zip the file before submitting.
- Name the file DREAM5_PBM_TeamName_Predictions.zip, replacing "TeamName" with the name of the team with which you registered for the challenge, and upload it in the upload site.
(2) A short (one to two page) write-up explaining the methodology used to generate your predictions: position weight matrix, dinucleotide, or “complex” (other) model. Submit the write-up as the file
replacing TeamName with the name of your team and the file extension (ext) with your choice of txt, doc, rtf, or pdf.
Model predictions will be evaluated using the held out data by Pearson/Spearman correlation, Precision/Recall-like analysis of the top scoring n 8-mers (where n varies from 1 to the number of possible 8-mers) and Root Mean Squared Error (RMSE), following (Chen et al. 2007) and (Alleyne et al. 2009).
“Name that factor” (optional) - For each of the TFs (TF_1 to TF_66) for which binding predictions were requested in the main part of the challenge, provide the actual name of the transcription factor. Please use official Mouse Genome Ionformatics (MGI) website symbols. For example, the E2F transcription factor 2 should be notated as E2f2. Submit your predictions using following provided template file:
ID Transcription Factor (MGI Id) TF_1 ? TF_2 ? TF_3 ? TF_4 ? ... TF_63 ? TF_64 ? TF_65 ? TF_66 ?
Upon submission, replace "TeamName" in the filename with the name of your team. Replace the "?" signs by your prediction of the corresponding TF. If you don't have a prediction, leave the "?" sign in place.
Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR. Predicting the binding preference of transcription factors to individual DNA k-mers. Bioinformatics. 2009 Apr 15;25(8):1012-8. Epub 2008 Dec 16.
Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML. Diversity and complexity in DNA recognition by transcription factors. Science. 2009 Jun 26;324(5935):1720-3. Epub 2009 May 14.
Benos PV, Bulyk ML, Stormo GD. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 2002 Oct 15;30(20):4442-51.
Berger MF, Bulyk ML. Protein binding microarrays (PBMs) for rapid, high-throughput characterization of the sequence specificities of DNA binding proteins. Methods Mol Biol. 2006;338:245-60.
Berger MF, Bulyk ML. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc. 2009;4(3):393-411.
Chen X, Hughes TR, Morris Q. RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics. 2007 Jul 1;23(13):i72-9.
He X, Chen CC, Hong F, Fang F, Sinha S, Ng HH, Zhong S. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS One. 2009 Dec 1;4(12):e8155.
Maerkl SJ, Quake SR. A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007 Jan 12;315(5809):233-7.
Mintseris J, Eisen MB. Design of a combinatorial DNA microarray for protein-DNA interaction studies. BMC Bioinformatics. 2006 Oct 3;7:429.
Philippakis AA, Qureshi AM, Berger MF, Bulyk ML. Design of compact, universal DNA microarrays for protein binding microarray experiments. J Comput Biol. 2008 Sep;15(7):655-65.
Sharon E, Lubliner S, Segal E. A feature-based approach to modeling protein-DNA interactions. PLoS Comput Biol. 2008 Aug 22;4(8):e1000154.
Zhao X, Huang H, Speed TP. Finding short DNA motifs using permuted Markov models. J Comput Biol. 2005 Jul-Aug;12(6):894-906.
The challenge was provided by Matthew T. Weirauch and Timothy R. Hughes, from the Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto. The challenge has been designed in collaboration with Robert Prill, and Gustavo Stolovitzky from the IBM T.J. Watson Research Center in New York, and Julio Saez-Rodriguez from Harvard Medical School and MIT.
Download Data (Registration Required).
Don't hesitate to post a question in the DREAM Discussion board if you need any clarification or have a suggestion about this challenge.