Transcription-Factor/DNA-Motif Recognition Challenge

DREAM5, Challenge 2

Introduction

Transcription factors (TFs) control the expression of genes through sequence-specific interactions with genomic DNA. Different TFs bind preferentially to different sequences, with the majority recognizing short (6-12 base), degenerate ‘motifs’. Modeling the sequence specificities of TFs is a central problem in understanding the function and evolution of the genome, because many types of genomic analyses involve scanning for potential TF binding sites. Models of TF binding specificity are also important for understanding the function and evolution of the TFs themselves.

Ideally, models of TF sequence binding specificity should predict the relative affinity (e.g. dissociation constant) to different individual sequences, and/or the probability of occupancy at any position in the genome. Currently, the major paradigm in modeling TF sequence specificity is the Position Weight Matrix (PWM) model. However, it is increasingly recognized that shortcomings of PWMs, such as their inability to model gaps, to capture dependencies between the residues in the binding site, or to account for the fact that TFs can have more than one DNA-binding interface, can make them inaccurate (Benos et al. 2002; Badis et al. 2009; Maerkl and Quake 2007). Alternative models that address some of the shortcomings of PWMs have been developed (Sharon et al. 2008; He et al. 2009; Zhao et al. 2005), but their relative efficacies have not been directly compared.

A major difficulty in studying TF DNA-binding specificity has been scarcity of data. The process of training and testing models benefits from a large number of unbiased data points. In the case of TF binding models, the required data is the relative preference of a TF to a large number of individual sequences. Recently, Protein Binding Microarrays (PBMs) have been developed for the purpose of determining TF sequence preferences (Berger et al. 2006). The resulting data provide a quantitative score representing the relative binding affinity of a given TF to the sequence of each probe contained on the array. PBM data have produced some of the strongest evidence for the inadequacy of PWM models (Badis et al. 2009), and also provide extensive training/test data.

Given the output of probe intensities of one PBM array type, this challenge consists of predicting the probe intensities of a second array type. Each array consists of ~41,000 60-base probe sequences (each containing 35 unique bases); the two array types have completely different probe sequences. Contestants may base their predictions on any type of model (e.g. position weight matrices), but the type of model and its details must be specified in order to correctly categorize the method for evaluation purposes.

Background

Modeling transcription factor DNA-binding activities is an active field, and there are many open questions. PBMs provide an opportunity to evaluate motif models and other representations. Each PBM is designed using de Bruijn sequences, such that all possible 10-mers, and 32 copies of every non-palindromic 8-mer are contained on each array, offering an unbiased survey of TF binding preferences. The two types of array designs (“ME” and “HK”) were constructed using different de Bruijn sequences: “HK” is one type, “ME” is another. Therefore, the two arrays have completely different probes from each other.

Current approaches to modeling PBM data first break the set of 35-mer probe signals into multiple measurements for each 8-mer, and summarize these measurements using either Z-scores (significance estimates based on the normal distribution of intensities) or E-scores (rank-based, non-parametric statistics created using a method similar to the Wilcoxon rank-sum test). The resulting data provides a “lookup table” summarizing the relative affinity of a given TF to each possible 8-base sequence, and is often converted to a position weight matrix by (for example) aligning all 8-mer sequences with significant scores. More advanced methods have been proposed for modeling PBM data, such as Seed and Wobble (Berger et al. 2006) and RankMotif++ (Chen et al. 2007). However, it is still an open question how to best use PBM data to model the binding preferences of a given TF.

The Challenge

The dataset for this challenge describes the binding preferences of 86 mouse TFs (representing a wide range of TF families) in the form of probe intensity signals. For 20 TFs, data (the training set) is provided from both HK and ME array types, for “practice” and method calibration. The challenge consists of predicting the signal intensities for the remaining 66 TFs. For 33 TFs, data will be provided from array type “ME”; data for the other 33 TFs will be provided for array type “HK”. Released data is to be used for model learning, and unreleased data will be used for evaluation purposes.

The Data

Three data sets will be provided in tabular form with \tab separated columns. In all cases, data columns correspond to output from the software package GenePix Pro version 6.0: “signal” corresponds to feature pixels, “background” corresponds to background pixels and "Flag" is a binary probe-quality field. For the "Flag" field, 0 means the probe quality is good whereas 1 means the probe was flagged as bad due to dust specs, scratches, or other imperfections. The data files are:

  ID     Array  Sequence                        Signal  Background  Signal    Background  Signal   Background   Flag
         Type                                   Mean    Mean        Median    Median      Std      Std
         
  Egr2   ME     CATGTAAGAAGTTATCCTGGCTGTCTAATG  15926   1030.75     18592.5   273.00      5495     1635.32      0
                CCGCTCCTGTGTGAAATTGTTATCCGCTCT
         
  Egr2   ME     TTGCTCATCAGATCGCGCTAACAGGCTTTC  17487   760.59      20249.0   265.50      7285     1077.09      0
                ACTTACCTGTGTGAAATTGTTATCCGCTCT
  ...    

  Egr2   HK     GCCAGTTTAGGTGGCGCCCGGAACCCTTAA  2972.4  574.22      2928.00   391.50      799.5    515.27       1
                CCCATCCTGTGTGAAATTGTTATCCGCTCT

  Egr2   HK     CATGTAGAGCCCTAAAACTGGGACTAAGCC  3552.3  608.46      3697.00   352.00      869.8    642.76       0
                GACCTCCTGTGTGAAATTGTTATCCGCTCT
  ...    

  Foxp2  ME     CATGTAAGAAGTTATCCTGGCTGTCTAATG  27336   3457.05     27283.0   1004.00     5440     5076.35      0
                CCGCTCCTGTGTGAAATTGTTATCCGCTCT

  Foxp2  ME     TTGCTCATCAGATCGCGCTAACAGGCTTTC  54822   6635.14     56181.0   1306.00     8275     10833.50     0
                ACTTACCTGTGTGAAATTGTTATCCGCTCT
  ...

  Foxp2  HK     GCCAGTTTAGGTGGCGCCCGGAACCCTTAA  36935   8580.54     37738.0   3724.00     2782     10308.83     0
                CCCATCCTGTGTGAAATTGTTATCCGCTCT

  Foxp2  HK     CATGTAGAGCCCTAAAACTGGGACTAAGCC  33758   6616.34     34661.0   2466.50     4111     8178.01      0
                GACCTCCTGTGTGAAATTGTTATCCGCTCT
  ...
  ID     Array  Sequence                        Signal  Background  Signal    Background  Signal   Background   Flag
         Type                                   Mean    Mean        Median    Median      Std      Std

  TF_1   HK     CTCTGTAAGTCAGGGTGACTCGAGCGGATC  4572.9  707.33      4674.00   406.00      987.4    715.31       0
                ACCTGCCTGTGTGAAATTGTTATCCGCTCT

  TF_1   HK     AGGTGGGTCCAATTATCCGATCTCACGTCG  4989.0  802.11      5126.00   473.00      810.6    800.87       0
                ACCTTCCTGTGTGAAATTGTTATCCGCTCT	
  ...

  TF_33  HK     GCCAGTTTAGGTGGCGCCCGGAACCCTTAA  913.42  599.35      830.00    383.00      378.0    538.39       0
                CCCATCCTGTGTGAAATTGTTATCCGCTCT

  TF_33  HK     CATGTAGAGCCCTAAAACTGGGACTAAGCC  849.74  336.91      790.00    193.00      305.7    397.59       1
                GACCTCCTGTGTGAAATTGTTATCCGCTCT

  TF_34  ME     CATGTAAGAAGTTATCCTGGCTGTCTAATG  12895   2739.14     13197.0   1159.00     1696     2893.35      0
                CCGCTCCTGTGTGAAATTGTTATCCGCTCT

  TF_34  ME     TTGCTCATCAGATCGCGCTAACAGGCTTTC  16045   2054.40     16319.0   709.00      3558     2556.29      0
                ACTTACCTGTGTGAAATTGTTATCCGCTCT
  ...

  TF_66  ME     CATGTAAGAAGTTATCCTGGCTGTCTAATG  3674.6  347.45      3974.50   147.00      1008     460.29       0
                CCGCTCCTGTGTGAAATTGTTATCCGCTCT

  TF_66  ME     TTGCTCATCAGATCGCGCTAACAGGCTTTC  3793.1  390.47      4073.00   150.00      1133     557.11       0
                ACTTACCTGTGTGAAATTGTTATCCGCTCT

Note about bad microarray spots

Microarray flags (i.e. bad spots - dust, scratches) were omitted from the originally posted data (both in files (files DREAM5_PBM_Data_TrainingSet.txt and DREAM5_PBM_Data_Needed_For_Predictions.txt). If these files were downloaded prior to June 8, 2010, they did not contain a "Flag" column. Data files with the flag column were posted on June 8, 2010. Typically much less than 1% of spots are flagged, but since these spots are suspect and may have aberrantly high or low intensity, they should be masked in training data. Flagged spots will also not be considered in the evaluations.

Important information regarding measurements and PBM array types

Submission

Participants are required to submit 2 files:

(1) For each transcription factor TF_1 to TF_33, please submit your predictions of “Signal Mean” for the probe sequences of array type ME. Likewise, for each transcription factor TF_34 to TF_66 please submit your predictions of “Signal Mean” for the probe sequences in the array type HK. Submit your predictions using the template file DREAM5_PBM_TeamName_Predictions.txt. The contents of this file are as follows:

  ID     Array  Sequence                                                      Signal
         Type                                                                 Mean

  TF_1   ME     CTCTGTAAGTCAGGGTGACTCGAGCGGATCACCTGCCTGTGTGAAATTGTTATCCGCTCT  ?
  TF_1   ME     AGGTGGGTCCAATTATCCGATCTCACGTCGACCTTCCTGTGTGAAATTGTTATCCGCTCT  ?	
  ...
  TF_33  ME     GCCAGTTTAGGTGGCGCCCGGAACCCTTAACCCATCCTGTGTGAAATTGTTATCCGCTCT  ?	
  TF_33  ME     CATGTAGAGCCCTAAAACTGGGACTAAGCCGACCTCCTGTGTGAAATTGTTATCCGCTCT  ?
  TF_34  HK     CATGTAAGAAGTTATCCTGGCTGTCTAATGCCGCTCCTGTGTGAAATTGTTATCCGCTCT  ?
  TF_34  HK     TTGCTCATCAGATCGCGCTAACAGGCTTTCACTTACCTGTGTGAAATTGTTATCCGCTCT  ?	
  ...
  TF_66  HK     CATGTAAGAAGTTATCCTGGCTGTCTAATGCCGCTCCTGTGTGAAATTGTTATCCGCTCT  ?	
  TF_66  HK     TTGCTCATCAGATCGCGCTAACAGGCTTTCACTTACCTGTGTGAAATTGTTATCCGCTCT  ?

Changes made on 9/13/2010 to allow for faster upload:

(2) A short (one to two page) write-up explaining the methodology used to generate your predictions: position weight matrix, dinucleotide, or “complex” (other) model. Submit the write-up as the file

DREAM5_PBM_TeamName_Writeup.ext

replacing TeamName with the name of your team and the file extension (ext) with your choice of txt, doc, rtf, or pdf.

Scoring Metrics

Model predictions will be evaluated using the held out data by Pearson/Spearman correlation, Precision/Recall-like analysis of the top scoring n 8-mers (where n varies from 1 to the number of possible 8-mers) and Root Mean Squared Error (RMSE), following (Chen et al. 2007) and (Alleyne et al. 2009).

Bonus Round

“Name that factor” (optional) - For each of the TFs (TF_1 to TF_66) for which binding predictions were requested in the main part of the challenge, provide the actual name of the transcription factor. Please use official Mouse Genome Ionformatics (MGI) website symbols. For example, the E2F transcription factor 2 should be notated as E2f2. Submit your predictions using following provided template file:

DREAM5_PBM_TeamName_BonusRoundAnswers.txt

  ID     Transcription Factor (MGI Id)
  TF_1   ?
  TF_2   ?
  TF_3   ?
  TF_4   ?
  ...
  TF_63  ?
  TF_64  ?
  TF_65  ?
  TF_66  ?

Upon submission, replace "TeamName" in the filename with the name of your team. Replace the "?" signs by your prediction of the corresponding TF. If you don't have a prediction, leave the "?" sign in place.

References

Alleyne TM, Peña-Castillo L, Badis G, Talukder S, Berger MF, Gehrke AR, Philippakis AA, Bulyk ML, Morris QD, Hughes TR. Predicting the binding preference of transcription factors to individual DNA k-mers. Bioinformatics. 2009 Apr 15;25(8):1012-8. Epub 2008 Dec 16.

Badis G, Berger MF, Philippakis AA, Talukder S, Gehrke AR, Jaeger SA, Chan ET, Metzler G, Vedenko A, Chen X, Kuznetsov H, Wang CF, Coburn D, Newburger DE, Morris Q, Hughes TR, Bulyk ML. Diversity and complexity in DNA recognition by transcription factors. Science. 2009 Jun 26;324(5935):1720-3. Epub 2009 May 14.

Benos PV, Bulyk ML, Stormo GD. Additivity in protein-DNA interactions: how good an approximation is it? Nucleic Acids Res. 2002 Oct 15;30(20):4442-51.

Berger MF, Bulyk ML. Protein binding microarrays (PBMs) for rapid, high-throughput characterization of the sequence specificities of DNA binding proteins. Methods Mol Biol. 2006;338:245-60.

Berger MF, Bulyk ML. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc. 2009;4(3):393-411.

Chen X, Hughes TR, Morris Q. RankMotif++: a motif-search algorithm that accounts for relative ranks of K-mers in binding transcription factors. Bioinformatics. 2007 Jul 1;23(13):i72-9.

He X, Chen CC, Hong F, Fang F, Sinha S, Ng HH, Zhong S. A biophysical model for analysis of transcription factor interaction and binding site arrangement from genome-wide binding data. PLoS One. 2009 Dec 1;4(12):e8155.

Maerkl SJ, Quake SR. A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007 Jan 12;315(5809):233-7.

Mintseris J, Eisen MB. Design of a combinatorial DNA microarray for protein-DNA interaction studies. BMC Bioinformatics. 2006 Oct 3;7:429.

Philippakis AA, Qureshi AM, Berger MF, Bulyk ML. Design of compact, universal DNA microarrays for protein binding microarray experiments. J Comput Biol. 2008 Sep;15(7):655-65.

Sharon E, Lubliner S, Segal E. A feature-based approach to modeling protein-DNA interactions. PLoS Comput Biol. 2008 Aug 22;4(8):e1000154.

Zhao X, Huang H, Speed TP. Finding short DNA motifs using permuted Markov models. J Comput Biol. 2005 Jul-Aug;12(6):894-906.

Authors

The challenge was provided by Matthew T. Weirauch and Timothy R. Hughes, from the Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto. The challenge has been designed in collaboration with Robert Prill, and Gustavo Stolovitzky from the IBM T.J. Watson Research Center in New York, and Julio Saez-Rodriguez from Harvard Medical School and MIT.


Download

Download Data (Registration Required).

Feedback

Don't hesitate to post a question in the DREAM Discussion board if you need any clarification or have a suggestion about this challenge.

Retrieved from "http://wiki.c2b2.columbia.edu/dream/index.php/D5c2"

This page has been accessed 20,575 times. This page was last modified 14:19, 23 December 2010.

x
Find
Browse
The DREAM Project
Community portal
Current events
Recent changes
Random page
Help
Donations
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Create an account or log in
Special pages
New pages
File list
Statistics
Bug reports
More...