Epitope-Antibody Recognition (EAR) Specificity Prediction
The datasets and gold standards for this challenge (The Epitope-Antibody Recognition Challenge) were generously provided prior to its publication and cannot be used for external publication without explicit permission from Dr. Hans-Juergen Thiesen (hj.thiesen_AT_gmx.de).
DREAM5, Challenge 1
Humoral immune responses are mediated through antibodies. About 1010 to 1012 different antigen binding sites called paratopes are generated by genomic recombination. These antibodies are capable to bind to a variety of structures ranging from small molecules to protein complexes, including any posttranslational modification thereof. When studying protein-antibody interactions, two types of epitopes (the region paratopes interact with) are to be distinguished from each other: i) conformational and ii) linear epitopes. All potential linear epitopes of a protein can be represented by short peptides derived from the primary amino acid sequence. These peptides can be synthesized and arrayed on solid supports, e.g. glass slides (see Lorenz et al., 2009 ). By incubating these peptide arrays with antibody mixtures such as human serum or plasma, peptides can be determined that interact with antibodies in a specific fashion.
The training set of this challenge comprises sequences of peptides that either bind intravenous immunoglobulin (IVIg) antibodies with high affinity/avidity (positive training set) or do not (negative training set). The challenge consists of determining for each peptide within the test set whether its reactivity with antibodies is strong or weak. Any approach that predicts the specificity scores of each peptide can in principle be applied for stratifying peptides presented in the test set into binders (to antibodies) and non-binders. Any publicly accessible information available for studying protein-protein-interactions as well as any approach enabling the determination of rule sets for predicting peptide-antibody affinities might be applied.
Antibody-protein interactions play a major role in various medicinal disciplines (infectious diseases, autoimmune diseases, oncology, vaccination and therapeutic interventions). Antibodies present in human blood interact with peptide sequences in a sequence–specific manner. Ideally, one specific antibody (monoclonal antibody) might exclusively bind one specific sequence. However, experimental data indicate that many antibodies bind to a panel of related or even distinct peptides and do so with different affinities. The open question is whether rules exist which enable the prediction of common peptide/epitope sequences that can be recognized by human antibodies. The binding site covered by an antibody typically includes a stretch of 8 to 10 amino acids. If peptides of 15 amino acids in length are incubated with one monospecific antibody, that antibody will bind to its epitope independently of the physical position of the binding motif within the peptide. Motifs running from position 1 to position 10 up to motifs running from position 6 to position 15 would be possible. This uncertainty results in difficulties for determining consensus binding sites as well as meaningful position weight matrices (PWM). Individual amino acids within epitope binding sites may have different impact on antibody recognition not only due to the nature of amino acids involved in binding (physicochemical properties) but also because of the specific position of the amino acid within the whole peptide sequence (context). In the experimental work leading to this challenge, 75,534 peptides were incubated with commercially available intravenous immunoglobulin (IVIg) fractions. IVIg is a mixture of naturally occurring human antibodies isolated from up to 100,000 healthy individuals. From this dataset a high confidence negative and positive pool of peptides was determined. The training and test datasets for this Challenge were assembled from these peptide pools.
From the collection of all the peptides incubated with human IVIg, a pool of 6,841 epitope containing peptide sequences reactive with human immunoglobulins was experimentally identified. This will be called the positive set. From the same original collection of peptides 20,437 peptides were identified that showed no antibody binding activity in any of the triplicate assays. This peptide set will be called the negative set. The training set was formed by picking 3,420 peptides from the positive set and 10,218 peptides from the negative set. The training set thus created contained 13,638 peptides and their respective binding reactivities. The test set was created by joining together the remaining 3,421 peptides from the positive set and the remaining 10,219 peptides from the negative set, for a total of 13,640 peptides.
The epitope-antibody recognition challenge consists of determining whether each peptide in the test set belongs to the positive or negative set. Any accessible specificity information on amino acids and protein-protein-interactions available in the scientific community can be used.
The teams participating in the challenge are strongly encouraged to participate in the following (optional) bonus round challenge. In the bonus round, the team submits a list with peptides with the following specifications. At least 1,000 peptides in this list are predicted to be as reactive as those peptides in the positive training set. At least 1,000 peptides in this list are predicted to have reactivity with antibodies that are not higher than those in the negative training set. And at least 1,000 peptides in the list are predicted to have reactivity values in between those of the positive and negative sets. All submitted peptide sequences shall not have stretches of more than three amino acids in common with any of the amino acid sequences supplied in the training or test set. Furthermore, the overall identity between any peptide sequence supplied in the training set and the peptide sequences predicted should not be more than 5 within a stretch of 11 amino acid positions.
The list of peptide predictions of the best performing teams will be subsequently evaluated experimentally to verify that the predicted reactivities are indeed in the predicted range. This bonus round, therefore, will assess de-novo predictions of epitope/antibody interactions.
contains the training set data. This file contains a \tab separated two-column table. The first column contains the peptide sequences. Most of these sequences are 15 amino acids long, but there are also some other sequence lengths (such as several 13 and a few 16, 18, and 21 amino acids long sequences). The second column contains a measure of the reactivity of the peptide to the IVIg antibodies. The data, sorted in descending order according to the second row, is represented below:
HWNNIRMSYMHAHTF 65423 DDWYLRYGIMNANFS 65388 ……. KVLLVQHQRVHSEEK 10019 INWPGLITNWSPQPF 10010 GLLLLTLSVLLAAGP 10007 SAKDSEHNEKYEDTF 941 MEGRSRDGGLRFGEM 755 ……. EEEYEEEGEEEGEKE 1 QVATHVKINVQMHLG 1
The second column ranges from 1 to 65423 (covering nearly all the possible dynamic range of 1 to 65,536 of the original peptide microarray signal intensities). The peptides whose signals range from 10,000 to 65,536 were deemed to belong to the pool of peptides reacting with the antibodies, and are located in the first 3,420 rows. On the other extreme, the peptides whose signal lies between 1 and 1,000 were deemed to belong to the non-reactive peptides and correspond to the last 10,218 rows in the training set. This binarization of the data in a reactive positive set and a non-reactive negative set is made for clarity in the scoring of the submissions, but is otherwise arbitrary. Therefore the training set contains a total of 13,638 rows, of which the first 3,420 rows constitute the pool of positive peptides and the last 10,218 rows constitute the pool of negative peptide sequences.
contains an alphabetically ordered list of 13,640 peptide sequences that constitute the test set. Most of these sequences are 15-amino acid long, but there are also some other sequence lengths (such as several 13 and a few 9, and 20-amino acid long sequences). The challenge participants are required to predict the affinity of these peptides to human IVIg antibody, as belonging to the positive or negative set. Among the 13,640 peptides in the test set, 3,421 had reactivity values between 10,000 and 65,536 and were classified as positive, 10,219 peptides had reactivity values between 1 and 1,000 and were classified as negative. The first and last peptides as given in the test set file are represented below.
AAAAAAAAAAAAAVA AAAAAAAAAAAAVAA AAAAAAAAAAAVAAA AAAAAAAAAVAAAPP AAAAAAARRQEQTLR AAAAACLSRQASSDS ………… YYGSGTPSSFPTVSL YYHLSQYYDNVSIDY YYIDGKIQTNNNTSN YYKYILTRNFEALNA YYNHAIDWQTGPGCN YYRIIIPVLLMLVFL YYYSISFSKIDGQQR
Participants are required to submit a ranked list of the peptides in the test set ordered according to the confidence you assign to the peptide to be in the positive set, ranking from the most reliable (first row) to the least reliable (last row) prediction. Use a 2 tab-separated column format as in:
S \tab XYZ
where S is one of the peptides sequences listed in the file DREAM5_EAR_TestSet.txt. XYZ is a score between 1 and 0 that indicates the confidence level you assign to the prediction. E.g., XYZ = 1 if peptide S is deemed to be in the positive set with highest confidence and XYZ = 0 if S is deemed to be in the negative set with the most confidence. All omitted peptides from file DREAM5_EAR_TestSet.txt will be considered to appear randomly ordered at the end of the list. Save your submission as a text file, and name it:
- DREAM5 _EAR_ TeamName_Predictions.txt
where "TeamName" is the name of the team with which you registered for the challenge. Best performance will be assessed based on the accuracy of the results of this prediction.
Bonus Round (optional)
(The predictions of submission to the Main Challenge will be evaluated whether or not the submitting team participates in the Bonus Round. However, participation in the bonus round is strongly encouraged, as it will provide a layer of experimental validation of a team’s method.)
We invite participants that tackle the prediction of de-novo peptides and their reactivity to antibodies. Peptide predictions for the bonus round should be submitted at the time of submission of the prediction for the main challenge. The submitted file should be a tab separated two-column table. The first column contains the peptide sequences whose reactivity category will be predicted. The second column contains the predicted reactivity category, and has to be one of the three letters: H, M, L, for High, Middle and Low reactivity. In other words, each column should look like
S \tab XYZ
where S is a peptide sequence of length 15 and XYZ can be any of the letters H, M and L. Save the file as text, and name it:
- DREAM5_ EAR_ TeamName_BonusRound.txt
Replace "TeamName" in the filename with the name of your team before submitting.
The following restrictions in your de-novo predicted sequences are made to discourage a conservative strategy of generating novel peptides in this bonus round:
- 1. All 1,000 or more submitted peptide sequences in the H class shall not have stretches of more than three amino acids in common with any of the sequences in the positive training set. If 4 consecutive amino acids in the bonus round class H submission are identical to any tetramer in the positive training set, then this sequence will not be considered.
- 2. If the identity between a peptide sequence predicted in class H and any of the positive training set sequences is 6 or more amino acids within a stretch of 11 amino acids then the predicted sequence will not be considered.
- 3. In like manner, all 1,000 or more submitted peptide sequences in the L class shall not have stretches of more than three amino acids in common with any of the sequences in the negative training set. If 4 consecutive amino acids in the bonus round class L submission are identical to any tetramer in the negative training set, then this sequence will not be considered.
- 4. If the identity between a peptide sequence predicted in class L and any of the negative training set sequences is 6 or more amino acids within a stretch of 11 amino acids then the predicted sequence will not be considered.
- 5. If less than 1,000 sequences remain in the bonus round class H or L submissions, the bonus round submission will not be accepted.
- 6. Class M submissions in the bonus round are not subjected to any of these constraints.
Finally we request that each participating team submits a short write-up (around two to three pages) explaining the methods used to arrive at their predictions of the Main Challenge, and of the Bonus Round if applicable. This write-up can contain pseudo-code describing the algorithm used, the rule sets describing subsets of peptide/antibody-specific recognitions, workflows for analysing peptide-antibody-interactions, etc. Submit the write-up as the file
- DREAM5_ EAR_TeamName_ Writeup.ext
replacing "TeamName" with the name of your team and the file extension (ext) with your choice of txt, doc, rtf, or pdf. The submission of this writeup is mandatory for participation in the main challenge.
Results will be scored using the area under the precision versus recall (PR) curve. For the first k peptides (ranked by the score contained in the second column in the file DREAM5_ EAR_ TeamName_Predictions.txt, and for peptides with the same score, taken in the order they were submitted in the prediction file), precision is defined as the fraction of correct positive set predictions to k, and recall is the proportion of correct positive set predictions out of all peptides in the positive set. Other metrics such as the area under the receiver operating characteristic (ROC) curve will also be evaluated. Teams will be ranked according to their overall performance based on the area under the PR and ROC curves. We will evaluate these predictions as discussed in . The predictions pertaining to the Bonus Round (the de-novo prediction of peptides with specified reactivity to antibodies) will be experimentally validated. This will be done only for predictions submitted by teams that achieved top performance in the main challenge. The score of this validation will be done using a measure of accuracy in the prediction of each of the requested categories.
 Lorenz P, Kreutzer M, Zerweck J, Schutkowski M, Thiesen HJ. Probing the epitope signatures of IgG antibodies in human serum from patients with autoimmune disease. Methods Mol Biol. 524:247-58 (2009)
 Stolovitzky G, Prill RJ, Califano A. "Lessons from the DREAM2 Challenges", in Stolovitzky G, Kahlem P, Califano A, Eds, Annals of the New York Academy of Sciences, 1158:159-95 (2009)
The challenge has been provided by Hans-Juergen Thiesen, from the Institute of Immunology, University of Rostock, Germany. Pre-publication data was provided generously by Peter Lorenz, Michael Hecker, Felix Steinbeck representing the research group of Hans-Juergen Thiesen, University of Rostock, Germany. The challenge has been designed in collaboration with Robert Prill and Gustavo Stolovitzky from the IBM T.J. Watson Research Center in New York and Julio Saez-Rodriguez from Harvard and MIT.
Download Data (Registration Required).
Don't hesitate to post a question in the DREAM Discussion board if you need any clarification or have a suggestion about this challenge.