Peptide Recognition Domain (PRD) Specificity Prediction
DREAM4, Challenge 1
Synopsis
Many important protein-protein interactions are mediated by peptide recognition domains (PRD), which bind short linear sequence motifs in other proteins. For example, SH3 domains typically recognize proline-rich motifs, PDZ domains recognize hydrophobic C-terminal tails, and kinases recognize short sequence regions around a phosphorylatable residue [1].
Given the sequence of the domains, the challenge consists of predicting a position weight matrix (PWM) that describes the specificity profile of each of the given domains to their target peptides. Any publicly accessible peptide specificity information available for the domain may be used.
Background
Ideally, PRD specificity could be predicted directly from the sequence of the domain itself. This will enable the prediction of protein-protein interaction networks directly from the genome sequence.
The specificity of selected human SH3, synthetic PDZ and kinase PRDs were experimentally mapped using phage display and combinatorial peptide libraries. The peptide libraries contain many short peptides with diverse sequences, around ten amino acids in length. The domain is used to select peptides from the library that bind to it. The set of peptides that bind to a domain defines a short, linear sequence pattern that the domain is expected to recognize. This pattern can be represented probabilistically as a position weight matrix (PWM). The PWM representation implicitly assumes independence of the motif positions. While in certain motifs interactions between some positions may exist, they are neglected for this challenge.
Publicly available information about the domain family that may be useful for prediction includes known ligands of members of the domain family from the literature or databases like DOMINO [2] or PDZBase [3] and structures from the PDB [4].
The Challenge
Peptides bound by SH3, PDZ, and kinase PRDs were experimentally identified. These data constitute an unpublished "gold standard" for the binding specificity of the selected PRDs.
Given the sequence of the domains, the challenge consists of predicting a position weight matrix (PWM) that describes the specificity profile of each of the given domains to their target peptides. Any publicly accessible peptide specificity information available for the domain may be used.
Data
- DREAM4_DomainSequences.txt contains 5 human SH3 domain sequences, 3 serine/threonine kinase sequences and 5 synthetic PDZ domain sequences modeled on Erbin (Erbb2 interacting protein).
Submission
Using the provided tab delimited template file
- DREAM4_TeamName_PWM.txt
and keeping the formatting of this file, submit a ten-column PWM for each domain. An example PWM is illustrated below. Each row corresponds to an amino acid, each column corresponds to the probability that the given amino acid is found at that position. Each of the ten columns must sum to 1.0. (Note that the amino acids are ordered alphabetically by IUPAC single letter code. Please keep this template format.)
A 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 C 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 D 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 E 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 F 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 G 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 H 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 I 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 K 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 L 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 M 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 N 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 P 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 Q 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 R 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 S 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 T 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 V 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 W 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 Y 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05
- If a column is not predicted, enter 0.05 for all rows in that column, signifying uniform background frequency.
- If a PWM is not predicted, leave 0.05 for all columns and all rows for that PWM.
- All PWM predictions must be placed in one text file according to the template, keeping the order of the template file as it is.
- A best performer will be identified for each of the three domain types (SH3, PDZ, and kinase). You must submit predictions for at least one of the domain types. All the instances of the PRD in a given domain type must be predicted in order for your submission to be scored in that domain type.
- Replace TeamName in the filename "DREAM4_TeamName_PRD.txt" with the name of your team before submitting.
Scoring Metrics
The submitted PWM predictions will be judged exclusively by similarity to the experimentally mapped PWM using the distance induced by the Frobenius Norm (http://mathworld.wolfram.com/FrobeniusNorm.html).
Domain specific notes:
- Kinase: Column 6 in the PWM must correspond to the phosphorylatable S/T residue in the peptide that binds to the kinase.
- PDZ: Column 10 in the PWM must correspond to C-terminus of the peptide that binds to the PDZ domain.
- SH3: No anchor position in the PWM is defined. Every possible alignment of the predicted SH3 peptide specificity PWM with the experimentally mapped SH3 peptide specificity PWM of length >=5 will be tried and the final score will be equal to the highest similarity found.
References
- Pawson T, Nash P (2003) Assembly of cell regulatory systems through protein interaction domains. Science 300: 445-452.
- Ceol A, Chatr-aryamontri A, Santonico E, Sacco R, Castagnoli L, et al. (2007) DOMINO: a database of domain-peptide interactions. Nucleic Acids Res 35: D557-560.
- Beuming T, Skrabanek L, Niv MY, Mukherjee P, Weinstein H (2005) PDZBase: a protein-protein interaction database for PDZ-domains. Bioinformatics 21: 827-828.
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. 28: 235-242.
Authors
The challenge was provided by Gary Bader and Philip M. Kim, from the Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto. Pre-publication data was provided generously by Sachdev Sidhu, Terrence Donnelly Center for Cellular and Biomolecular Research, University of Toronto and Ben Turk, Deparment of Pharmacology, Yale University. The challenge has been designed in collaboration with Robert Prill and Gustavo Stolovitzky from the IBM T.J. Watson Research Center in New York.
Download
Don't hesitate to post a question in the DREAM discussion board if you need any clarification on this challenge.
