Systems Genetics challenges
DREAM5, Challenge 3
| Note . This Systems Genetics Challenge is composed of 2 major subchallenges. DREAM5 SYSGEN A based on in-silico data and designed to elucidate causal network models among genes, and DREAM5 SYSGEN B based on experimental data on soybean and designed to predict complex phenotypes from a combination of genetics and expression data. |
Introduction
The central goal of systems biology is to gain a predictive, system-level understanding of biological networks. This can be done, for example, by inferring causal networks from observations on a perturbed biological system. An ideal experimental design for causal inference is randomized, multifactorial perturbation. The recognition that the genetic variation in a segregating population represents randomized, multifactorial perturbations (Jansen and Nap (2001), Jansen (2003)) gave rise to Systems Genetics (SG), where a segregating or genetically randomized population is genotyped for many DNA variants, and profiled for phenotypes of interest (e.g. disease phenotypes), gene expression, and potentially other ‘omics’ variables (protein expression, metabolomics, DNA methylation, etc.; Figure 1. Figure 1 was taken from Jansen and Nap (2001)). In this challenge we explore the use of Systems Genetics data for elucidating causal network models among genes, i.e. Gene Networks (DREAM5 SYSGEN A) and predicting complex disease phenotypes (DREAM5 SYSGEN B).
DREAM5 SYSGEN A – In-silico network challenge
The goal of “DREAM5 SYSGEN A – In-silico network challenge” is to create models with biological interpretation, i.e. to reverse-engineer Gene Networks (GNs), from Systems Genetics data.
From previous DREAM challenges, especially the DREAM3 [Prill, et al. (2010), Marbach et al, (2010)] and DREAM4 In-silico Network Challenges, it has become unambiguously clear that systematic perturbations (e.g. experimental gene knockouts and knockdowns) and measurements of responses greatly contribute to establish the directed structure of GNs. However, large scale systematic knockouts may be unrealistic or unfeasible for many cell types and even impossible for some organisms. Systems genetics experiments, as considered here, could provide an alternative. Genetic polymorphisms, which are naturally present in populations, act as multifactorial genetic perturbations that could be used to elucidate causal links between genes. For example, if the mean expression levels of gene B are significantly different between two groups of individuals, one with one genetic variant of gene A and the other with another genetic variant of gene A, this observation is highly indicative for a causal regulatory effect from gene A to gene B. Recently, multiple approaches have been applied to Systems Genetics data in order to elucidate GNs. For a great introduction to these methods please see (Rockman (2008)).
The Generative Model
Due to the lack of a reliable experimentally determined Gold Standard network, this challenge is based on simulated Systems Genetics data (Liu et al (2008)). Systems Genetics data was simulated with our MATLAB software tool SysGenSIM (Pinna, Soranzo, Hoeschele and de la Fuente, unpublished).
In these simulations we consider data from Recombinant Inbred Lines (RILs), i.e. a set of homozygous lines derived from a cross between two genetically diverse inbred parent lines, through inbreeding for multiple generations. Each of these RILs is homozygous for the allele of one of the parents, and each RIL has inherited different combinations of parental alleles: the RILs constitute a genetically randomized population. In other words, the gene expression pattern of each RIL is the result of a different multifactorial genetic perturbation.
We generated genotyping and gene expression data for “in-silico” RILs populations in the following way:
- Networks of 1000 genes with ‘modular scale-free topology’ were generated, and the dynamical model (Figure 2) was defined according to each network structure.
- For each of the networks, we generated the genotypes of a population of N RILs. Each RIL is represented as a vector of binary genotype values (0/1), one for each of 1000 homozygous genes. We propose three subchallenges, each with 5 different networks, and each network with different RIL populations of size: N = 100 (subchallenge A1), N = 300 (subchallenge A2) and N = 999 (subchallenge A3).
- For all networks, 20 chromosomes with 50 genes each were considered. The 1/0 values in the genotype vectors for each RIL were sampled with correlations between adjacent positions on the chromosomes (mimicking ‘genetic linkage’). No relationship between network positions of genes and their locations on chromosomes was assumed.
- Each gene was assumed to have a single (functional) genetic variant, either in the gene’s promoter region (leading to a ‘cis effect’ on its expression rate) with probability 0.25 or in the gene’s coding region (leading to ‘trans effects’ on its targets) with probability 0.75.
- Steady state gene-expression levels for all RILs were calculated after adjusting the Z parameters according to the corresponding genotype vector, and setting the values for θgtrsc and θgdeg.
- Simulations were done using the deterministic ordinary differential equations (ODEs) of Fig. 2. Simulated experimental noise is added to the steady state values.
The Data
| Important Note Please let the organizers know if you plan to use the data of subchallenge A in you own publications outside of the DREAM challenge context.. |
In each of the three subchallenges that constitute the DREAM5 SysGenA challenge we provide data for networks with 1000 genes in files formatted as \tab separated values files, as indicated below:
- Subchallenge A1: In this subchallenge we provide a population of 100 RILs. For each of 5 networks, we provide 2 data files:
- Each of the files DREAM5_SysGenA100_Networki_Expression.tsv contains a matrix of dimension 100×1000 whose entries are continuous gene-expression values corresponding to network i, where i ∈ {1,2,3,4,5}.
- Each of the files DREAM5_SysGenA100_Networki_Genotype.tsv contains a matrix of dimension 100×1000 whose entries are binary genotype values (0/1) for network i, where i ∈ {1,2,3,4,5}.
- Subchallenge A2: In this subchallenge we provide a population of 300 RILs. For each of 5 networks (different from those of subchallenge A1), we provide 2 data files:
- Each of the files DREAM5_SysGenA300_Networki_Expression.tsv contains a matrix of dimension 300×1000 whose entries are continuous gene-expression values corresponding to network i, where i ∈ {1,2,3,4,5}.
- Each of the files DREAM5_SysGenA300_Networki_Genotype.tsv contains a matrix of dimension 300×1000 whose entries are binary genotype values (0/1) for network i, where i ∈ {1,2,3,4,5}.
- Subchallenge A3: In this subchallenge we provide a population of 999 RILs. For each of 5 networks (different from those of subchallenge A1 and A2), we provide 2 data files:
- Each of the files DREAM5_SysGenA999_Networki_Expression.tsv contains a matrix of dimension 999×1000 whose entries are continuous gene-expression values corresponding to network i, where i ∈ {1,2,3,4,5}.
- Each of the files DREAM5_SysGenA999_Networki_Genotype.tsv contains a matrix of dimension 999×1000 whose entries are binary genotype values (0/1) for network i, where i ∈ {1,2,3,4,5}.
| Note The first row of these files is a header row ("G1" "G2" ... "G1000") that indicates what gene the column refers to. In this way, j-th column refers to the expression levels or the genetic variant of the same gene "Gj". The rows are ordered such that the i-th row after the header row in both Expression and Genotype files, correspond to data from the same RIL i. |
| Update July 06 2010 : A participant discovered a mistake in the simulated genotyping data. We have solved the problem and re-generated all data. If you have downloaded the data before July 06 2010, please download the new data sets and discard the earlier downloaded data sets. |
What do we want to learn from this challenge? : This challenge is aimed at identifying the best approaches for Gene Network inference from Systems Genetics data for varying sample sizes, in particular considering the N << p problem where the number of observations N (number of RILs) is less than the number of variables p (number of genes).
Submission
In order for this challenge to yield light on the performance of the algorithms under different data sizes, participants are strongly encouraged to submit predictions to the three subchallenge. However, predictions to only one or two of the three subchallenges will be accepted.
Participants are required to submit:
1. For each subchallenge submit five files, each containing a ranked list of no more than 100,000 regulatory link predictions ordered according to the confidence you assign to the predictions, from the most reliable (first row) to the least reliable (last row) prediction. Use a 3 tab-separated column format as in the example below:
A \tab B \tab XYZ
where A and B are two different genes (G1, G2,…,G1000). No self-interactions (A \tab A \tab XYZ) will be considered. Links are directed: the gene in the first column regulates the gene in the second column. (If both A regulates B and B regulates A, then both lines should be included separately.). Links are unsigned (positive and negative regulation are evaluated as just regulation). XYZ is a score between 0 and 1 that indicates the confidence level you assign to the prediction. (E.g., XYZ = 1 if gene A is deemed to regulate gene B with highest confidence and XYZ = 0 if A is deemed not to directly regulate B). All pairs omitted from the list will be considered to appear randomly ordered at the end of the list. Save the file as text, and name it:
- DREAM5_TeamName_SubChallenge_Networki.txt
where "TeamName" is the name of the team with which you registered for the challenge, "SubChallenge" is either SysGenA100, SysGenA300, or SysGenA999, and "Networki" is one of the five networks of the indicated subchallenge (Network1, Network2,..., Network5). To participate in a subchallenge, you need to submit predictions for all five networks.
2. Submit a short (one to two page) write-up explaining the methodology used to generate their predictions submit the write-up as the file
- DREAM5_TeamName_ SysGenA_Writeup.ext
replacing "TeamName" with the name of your team and the file extension ("ext") with your choice of txt, doc, rtf, or pdf.
Scoring metrics
We will score the results using the area under the Precision versus Recall (PR) curve for the whole set of link predictions for a network. For the first k predictions (ranked by score, and for predictions with the same score, taken in the order they were submitted in the prediction files), precision is defined as the fraction of correct predictions to k, and recall is the proportion of correct predictions out of all the possible true connections. Also the area under the receiver operating characteristic (ROC; http://en.wikipedia.org/wiki/Roc_curve) curve will be evaluated. For each subchallenge an overall score will be obtained as in previous DREAM In-silico Network challenges. The precise scoring system can be found in Stolovitzky et al.(2009). Teams will be ranked according to their overall performance over the five networks of a challenge.
DREAM5 SYSGEN B – The Systems Genetics of soybean data challenge
The ability to predict complex phenotypes (such as disease susceptibility) from genotyping and/or gene expression is one of the keys that will open the door to personalized medicine. Both types of data are currently collected at a tremendous rate using established technologies such as DNA arrays or with emerging ones such as DNA and RNA next-generation sequencing. The goal of this challenge (see Figure 3) is to obtain models that predict disease phenotypes from:
- (i) only genotype data (subchallenge B1),
- (ii) only gene-expression data (subchallenge B2), or
- (iii) both genotype and gene-expression data (subchallenge B3).
Participants are referred to Chen, B.J., et al. (2009) for an idea how to tackle such problems.
The Data
| Important Note Part of the data of the subchallenge SysGenB was generously provided prior to publication. The datasets of this challenge may not be used for publication without explicit permission of the data owners. Please contact Gustavo Stolovitzky to coordinate with the owners if you plan to use the data for a publication (gustavo@us.ibm.com). Once the owners have published the data, all datasets may be freely used. The reference to cite will be posted here.".. |
The data comes from a Systems Genetics experiment in soybean conducted at the Virginia Bioinformatics Institute VBI (data kindly provided by Brett Tyler and colleagues at VBI, Virginia Tech and Ohio State University). In this study two inbred parental lines, differing substantially in susceptibility to a major pathogen, were crossed and their offspring were selfed (inbred) for more than 12 generations to produce a population of Recombinant Inbred Lines (RILs). There is nearly no genetic variation within each RIL but much variation among RILs due to to the fact that each RIL represents a different combination of the two parental genomes. Each RIL was genotyped for 941 genetic variants, and gene-expression profiled for 28,395 genes. Note that the gene expression data was measured in uninfected plants as to determine if disease resistance can be predicted from ‘normal’ gene expression.
Then the plants were infected with a pathogen (Phytophthora sojae) and assayed for two continuous phenotypes related to the severity of infection:‘percent present’ and ‘scale factor’. Both of these phenotypes are measures of the amount of pathogen RNA in the infected tissue sample and thus are measures of the density of pathogen colonization of the infected tissue. Percent present means the fraction of pathogen probe sets that yield a detectable hybridization signal as determined by the MAS5 presence/absence call in the Affymetrix software used to analyze the data. Scale factor is the ratio between the sum of all the background-subtracted soybean probe intensities to the sum of all the background-subtracted P. sojae probe intensities. Plants which are resistant against the pathogen have generally low values for these phenotypes, while susceptible plants have high values.
The training set consists of data from 200 RILs for which genotype, gene-expression and phenotype information are provided. Three training data files are provided in the file DREAM5_SysGenB_TrainingData.zip, downloadable from the DREAM web site:
- DREAM5_SysGenB_TrainingGenotypeData.txt: contains a data matrix Genotype(941×200) with genotype values (0/1).
- DREAM5_SysGenB_TrainingExpressionData.txt: contains a data matrix Expression(28,395×200) with continuous gene-expression values and
- DREAM5_SysGenB_TrainingPhenotypeData.txt: contains data matrix Phenotype(2×200) with continuous phenotype values.
| Update July 06 2010 : By request of one of the challenge participants we now also provide files with the IDs of gene expression probes (DREAM5_SysGenB_ExpressionProbeIDs.xls) and genetic markers (DREAM5_SysGenB_GenotypeMarkerIDs.xls). |
The Challenge
This challenge is aimed at identifying the best predictive modeling approaches as well as evaluating the benefits of learning from combined genotype and gene-expression data. This challenge is composed of three subchallenges. Participants are encouraged to submit predictions to all of the following three subchallenges. Predictions for individual subchallenges will also be accepted.
- Subchallenge B1: The challenge is to predict the two phenotypes from the genotype data only. The test data set consists of 30 RILs for which the genotype of the 941 markers is provided, but the corresponding values for the two phenotypes are withheld. The file DREAM5_SysGenB1_TestGenotypeData.txt contains a data matrix Genotype(941×30) with genotype binary values (0/1) from which the phenotypes are to be predicted.
- Subchallenge B2: The challenge is to predict the two macroscopic phenotypes from the gene-expression levels. The test data set consists of 30 RILs (different from those of subchallenge B1) for which the gene-expression data of the 28,395 genes are provided. The file DREAM5_SysGenB2_TestExpressionData.txt contains a data matrix Expression(28,395×30) with gene-expression values from which the phenotypes are to be predicted.
- Subchallenge B3: The challenge is to predict the two macroscopic phenotypes from genotype and expression data. The test data set consists of 30 RILs (different from those of subchallenges B1 and B2) for which the genotype on the 941 markers and gene-expression data on the 28,395 genes is provided. The files DREAM5_SysGenB3_TestGenotypeData.txt and DREAM5_SysGenB3_TestExpressionData.txt containing respectively the genotype matrix Genotype(941×30) and gene-expression matrix Expression(28,395×30) are provided, from which the phenotypes are to be predicted.
The file DREAM5_SysGenB_TestData.zip, which contains the files DREAM5_SysGenB1_TestGenotypeData.txt, DREAM5_SysGenB2_TestExpressionData.txt, DREAM5_SysGenB3_TestGenotypeData.txt and DREAM5_SysGenB3_TestExpressionData.txt, can be downloaded from the DREAM web site.
Submission
Participants are required to submit the values of the phenotype traits and a write-up (mandatory) describing their methods. For the submission of the phenotype predictions, we provided a template file DREAM5_SysGenB_Predictions.xls DREAM5_SysGenB_Predictions.csv which has the following structure:
| RIL1 | RIL2 | ... | RIL30 | |
|---|---|---|---|---|
| Phenotype1 | PREDICT | PREDICT | ... | PREDICT |
| Phenotype2 | PREDICT | PREDICT | ... | PREDICT |
Please submit one file per subchallenge, replacing in the template file "PREDICT" by the predicted phenotype values for each RIL. At submission rename the file to
DREAM5_TeamName_SubChallenge_Predictions.xls
DREAM5_TeamName_SubChallenge_Predictions.csv
where "TeamName" is the name of the team with which you registered for the challenge and "SubChallenge" is either "SysGenB1", "SysGenB2" or "SysGenB3".
Write-up. We request that each participating team submits a short write-up (around two to three pages) explaining the methods used to arrive at their predictions of the phenotypes. This write-up, which is mandatory for submission, can contain pseudo-code, workflows, and explanations of the concepts. Submit the write-up as the file
DREAM5_TeamName_SysGenB_Writeup.ext
replacing "TeamName" with the name of your team and the file extension (ext) with your choice of txt, doc, rtf, or pdf. The submission of this write-up is mandatory for participation in this challenge.
Scoring metrics
Each subchallenge will be scored individually. A p-value for the rank correlation between predicted phenotype values and the actual withheld phenotype values will be calculated. Overall scores for each subchallenge are obtained by combining the p-values for the two phenotypes in a single score:
where p-valuei refers to the p-value for phenotype i predictions. Teams will be ranked based on this score.
References/Further reading
Jansen, R.C., and Nap, J.P. (2001) Genetical genomics: the added value from segregation. Trends Genet. 17, 388-391 (source of Figure 1)
Jansen, R.C. (2003) Studying complex biological systems using multifactorial perturbation. Nat. Rev. Gen. 4, 145-151
Liu B, de la Fuente A and Hoeschele I. (2008) Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 178, 1763-1776
R. J. Prill, D. Marbach, J. Saez-Rodriguez, P. K. Sorger, L. G. Alexopoulos, X. Xue, N. D. Clarke, G. Altan-Bonnet and G. Stolovitzky (2010) Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges. PLoS One, 5(2): 9202
Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D and Stolovitzky G., Revealing strengths and weaknesses of methods for gene network inference (2010), Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6286-91.
Rockman, M.V. (2008) Reverse engineering the genotype-phenotype map with natural genetic variation. Nature 456, 738-744
Stolovitzky G, Prill RJ and Califano A. (2009) Lessons from the DREAM2 Challenges, in Stolovitzky G, Kahlem P, Califano A, Eds, Annals of the New York Academy of Sciences, 1158:159-95
Chen BJ, Causton HC, Mancenido D, Goddard NL, Perlstein EO, and Pe'er D. (2009) Harnessing gene expression to identify the genetic basis of drug resistance. Mol Syst Biol 5, 310
Authors
The challenge has been provided by Alberto de la Fuente, Andrea Pinna and Nicola Soranzo from CRS4 Bioinformatica in Sardinia, Italy, and Ina Hoeschele and Brett Tyler from the Virginia Bioinformatics Institute, VA USA. The challenge has been designed in collaboration with Robert Prill and Gustavo Stolovitzky from the IBM T.J. Watson Research Center in New York and Julio Saez-Rodriguez from Harvard and MIT.
Download
Download Data (Registration Required).
Questions and Feedback
Don't hesitate to post a question in the DREAM Discussion board or directly contact Alberto de la Fuente (alf@crs4.it) or Gustavo Stolovitzky(gustavo@us.ibm.com) if you need any clarification or have a suggestion about this challenge.
