Systems Genetics challenges

DREAM5, Challenge 3

Note . This Systems Genetics Challenge is composed of 2 major subchallenges. DREAM5 SYSGEN A based on in-silico data and designed to elucidate causal network models among genes, and DREAM5 SYSGEN B based on experimental data on soybean and designed to predict complex phenotypes from a combination of genetics and expression data.

Introduction

The central goal of systems biology is to gain a predictive, system-level understanding of biological networks. This can be done, for example, by inferring causal networks from observations on a perturbed biological system. An ideal experimental design for causal inference is randomized, multifactorial perturbation. The recognition that the genetic variation in a segregating population represents randomized, multifactorial perturbations (Jansen and Nap (2001), Jansen (2003)) gave rise to Systems Genetics (SG), where a segregating or genetically randomized population is genotyped for many DNA variants, and profiled for phenotypes of interest (e.g. disease phenotypes), gene expression, and potentially other ‘omics’ variables (protein expression, metabolomics, DNA methylation, etc.; Figure 1. Figure 1 was taken from Jansen and Nap (2001)). In this challenge we explore the use of Systems Genetics data for elucidating causal network models among genes, i.e. Gene Networks (DREAM5 SYSGEN A) and predicting complex disease phenotypes (DREAM5 SYSGEN B).

Image:DataTypes.jpeg

DREAM5 SYSGEN A – In-silico network challenge

The goal of “DREAM5 SYSGEN A – In-silico network challenge” is to create models with biological interpretation, i.e. to reverse-engineer Gene Networks (GNs), from Systems Genetics data.

From previous DREAM challenges, especially the DREAM3 [Prill, et al. (2010), Marbach et al, (2010)] and DREAM4 In-silico Network Challenges, it has become unambiguously clear that systematic perturbations (e.g. experimental gene knockouts and knockdowns) and measurements of responses greatly contribute to establish the directed structure of GNs. However, large scale systematic knockouts may be unrealistic or unfeasible for many cell types and even impossible for some organisms. Systems genetics experiments, as considered here, could provide an alternative. Genetic polymorphisms, which are naturally present in populations, act as multifactorial genetic perturbations that could be used to elucidate causal links between genes. For example, if the mean expression levels of gene B are significantly different between two groups of individuals, one with one genetic variant of gene A and the other with another genetic variant of gene A, this observation is highly indicative for a causal regulatory effect from gene A to gene B. Recently, multiple approaches have been applied to Systems Genetics data in order to elucidate GNs. For a great introduction to these methods please see (Rockman (2008)).

The Generative Model

Due to the lack of a reliable experimentally determined Gold Standard network, this challenge is based on simulated Systems Genetics data (Liu et al (2008)). Systems Genetics data was simulated with our MATLAB software tool SysGenSIM (Pinna, Soranzo, Hoeschele and de la Fuente, unpublished).

Image:DREAM5_Fig2.JPG
We employ the dynamical model shown in Figure 2, where Cg is the mRNA concentration C of gene g, Vg is its basal transcription rate, and δg its degradation rate constant. The velocity of transcription of a gene g is a function of the normalized product of the state of its regulators Rg, where Cn is the expression level of the input n of gene g, Kgn/Zn represents the concentrations of gene n at which its effect on the transcription rate of gene g is half of its maximum effect, hgn is a cooperativity coefficient, Agn is an element of a matrix A encoding the signed network structure (Agn = -1 for an inhibitor, Agn = 1 for an activator, and Agn = 0 if n has no effect on g). Zg (cis-effect) and Zn (trans-effect) are parameters representing the effects of DNA polymorphisms in the model. We set the values of the Zs to either 1 or 0.75 depending on the binary genotype value. θgtrsc and θgdeg represent fluctuations in the transcription and degradation rates, respectively, and are sampled from a normal distribution before the calculation of the steady state. All other parameters remain fixed throughout the generation of a dataset. For simplicity we have set all Vg, Kgn and δg to 1. The value of cooperativity coefficients hgn is set to 1, 2 or 4 with probabilities 0.6, 0.3 and 0.1, respectively.

In these simulations we consider data from Recombinant Inbred Lines (RILs), i.e. a set of homozygous lines derived from a cross between two genetically diverse inbred parent lines, through inbreeding for multiple generations. Each of these RILs is homozygous for the allele of one of the parents, and each RIL has inherited different combinations of parental alleles: the RILs constitute a genetically randomized population. In other words, the gene expression pattern of each RIL is the result of a different multifactorial genetic perturbation.

We generated genotyping and gene expression data for “in-silico” RILs populations in the following way:

The Data

Important Note Please let the organizers know if you plan to use the data of subchallenge A in you own publications outside of the DREAM challenge context..

In each of the three subchallenges that constitute the DREAM5 SysGenA challenge we provide data for networks with 1000 genes in files formatted as \tab separated values files, as indicated below:

Note The first row of these files is a header row ("G1" "G2" ... "G1000") that indicates what gene the column refers to. In this way, j-th column refers to the expression levels or the genetic variant of the same gene "Gj". The rows are ordered such that the i-th row after the header row in both Expression and Genotype files, correspond to data from the same RIL i.
Update July 06 2010  : A participant discovered a mistake in the simulated genotyping data. We have solved the problem and re-generated all data. If you have downloaded the data before July 06 2010, please download the new data sets and discard the earlier downloaded data sets.

What do we want to learn from this challenge? : This challenge is aimed at identifying the best approaches for Gene Network inference from Systems Genetics data for varying sample sizes, in particular considering the N << p problem where the number of observations N (number of RILs) is less than the number of variables p (number of genes).

Submission

In order for this challenge to yield light on the performance of the algorithms under different data sizes, participants are strongly encouraged to submit predictions to the three subchallenge. However, predictions to only one or two of the three subchallenges will be accepted.

Participants are required to submit:

1. For each subchallenge submit five files, each containing a ranked list of no more than 100,000 regulatory link predictions ordered according to the confidence you assign to the predictions, from the most reliable (first row) to the least reliable (last row) prediction. Use a 3 tab-separated column format as in the example below:

A \tab B \tab XYZ

where A and B are two different genes (G1, G2,…,G1000). No self-interactions (A \tab A \tab XYZ) will be considered. Links are directed: the gene in the first column regulates the gene in the second column. (If both A regulates B and B regulates A, then both lines should be included separately.). Links are unsigned (positive and negative regulation are evaluated as just regulation). XYZ is a score between 0 and 1 that indicates the confidence level you assign to the prediction. (E.g., XYZ = 1 if gene A is deemed to regulate gene B with highest confidence and XYZ = 0 if A is deemed not to directly regulate B). All pairs omitted from the list will be considered to appear randomly ordered at the end of the list. Save the file as text, and name it:

DREAM5_TeamName_SubChallenge_Networki.txt

where "TeamName" is the name of the team with which you registered for the challenge, "SubChallenge" is either SysGenA100, SysGenA300, or SysGenA999, and "Networki" is one of the five networks of the indicated subchallenge (Network1, Network2,..., Network5). To participate in a subchallenge, you need to submit predictions for all five networks.

2. Submit a short (one to two page) write-up explaining the methodology used to generate their predictions submit the write-up as the file

DREAM5_TeamName_ SysGenA_Writeup.ext

replacing "TeamName" with the name of your team and the file extension ("ext") with your choice of txt, doc, rtf, or pdf.

Scoring metrics

We will score the results using the area under the Precision versus Recall (PR) curve for the whole set of link predictions for a network. For the first k predictions (ranked by score, and for predictions with the same score, taken in the order they were submitted in the prediction files), precision is defined as the fraction of correct predictions to k, and recall is the proportion of correct predictions out of all the possible true connections. Also the area under the receiver operating characteristic (ROC; http://en.wikipedia.org/wiki/Roc_curve) curve will be evaluated. For each subchallenge an overall score will be obtained as in previous DREAM In-silico Network challenges. The precise scoring system can be found in Stolovitzky et al.(2009). Teams will be ranked according to their overall performance over the five networks of a challenge.


DREAM5 SYSGEN B – The Systems Genetics of soybean data challenge

The ability to predict complex phenotypes (such as disease susceptibility) from genotyping and/or gene expression is one of the keys that will open the door to personalized medicine. Both types of data are currently collected at a tremendous rate using established technologies such as DNA arrays or with emerging ones such as DNA and RNA next-generation sequencing. The goal of this challenge (see Figure 3) is to obtain models that predict disease phenotypes from:

(i) only genotype data (subchallenge B1),
(ii) only gene-expression data (subchallenge B2), or
(iii) both genotype and gene-expression data (subchallenge B3).

Participants are referred to Chen, B.J., et al. (2009) for an idea how to tackle such problems. Image:DREAM5_Fig3.jpeg

The Data

Important Note Part of the data of the subchallenge SysGenB was generously provided prior to publication. The datasets of this challenge may not be used for publication without explicit permission of the data owners. Please contact Gustavo Stolovitzky to coordinate with the owners if you plan to use the data for a publication (gustavo@us.ibm.com). Once the owners have published the data, all datasets may be freely used. The reference to cite will be posted here."..


The data comes from a Systems Genetics experiment in soybean conducted at the Virginia Bioinformatics Institute VBI (data kindly provided by Brett Tyler and colleagues at VBI, Virginia Tech and Ohio State University). In this study two inbred parental lines, differing substantially in susceptibility to a major pathogen, were crossed and their offspring were selfed (inbred) for more than 12 generations to produce a population of Recombinant Inbred Lines (RILs). There is nearly no genetic variation within each RIL but much variation among RILs due to to the fact that each RIL represents a different combination of the two parental genomes. Each RIL was genotyped for 941 genetic variants, and gene-expression profiled for 28,395 genes. Note that the gene expression data was measured in uninfected plants as to determine if disease resistance can be predicted from ‘normal’ gene expression.

Then the plants were infected with a pathogen (Phytophthora sojae) and assayed for two continuous phenotypes related to the severity of infection:‘percent present’ and ‘scale factor’. Both of these phenotypes are measures of the amount of pathogen RNA in the infected tissue sample and thus are measures of the density of pathogen colonization of the infected tissue. Percent present means the fraction of pathogen probe sets that yield a detectable hybridization signal as determined by the MAS5 presence/absence call in the Affymetrix software used to analyze the data. Scale factor is the ratio between the sum of all the background-subtracted soybean probe intensities to the sum of all the background-subtracted P. sojae probe intensities. Plants which are resistant against the pathogen have generally low values for these phenotypes, while susceptible plants have high values.

The training set consists of data from 200 RILs for which genotype, gene-expression and phenotype information are provided. Three training data files are provided in the file DREAM5_SysGenB_TrainingData.zip, downloadable from the DREAM web site:

Update July 06 2010  : By request of one of the challenge participants we now also provide files with the IDs of gene expression probes (DREAM5_SysGenB_ExpressionProbeIDs.xls) and genetic markers (DREAM5_SysGenB_GenotypeMarkerIDs.xls).

The Challenge

This challenge is aimed at identifying the best predictive modeling approaches as well as evaluating the benefits of learning from combined genotype and gene-expression data. This challenge is composed of three subchallenges. Participants are encouraged to submit predictions to all of the following three subchallenges. Predictions for individual subchallenges will also be accepted.

The file DREAM5_SysGenB_TestData.zip, which contains the files DREAM5_SysGenB1_TestGenotypeData.txt, DREAM5_SysGenB2_TestExpressionData.txt, DREAM5_SysGenB3_TestGenotypeData.txt and DREAM5_SysGenB3_TestExpressionData.txt, can be downloaded from the DREAM web site.

Submission

Participants are required to submit the values of the phenotype traits and a write-up (mandatory) describing their methods. For the submission of the phenotype predictions, we provided a template file DREAM5_SysGenB_Predictions.xls DREAM5_SysGenB_Predictions.csv which has the following structure:

RIL1 RIL2 ... RIL30
Phenotype1 PREDICT PREDICT ... PREDICT
Phenotype2 PREDICT PREDICT ... PREDICT

Please submit one file per subchallenge, replacing in the template file "PREDICT" by the predicted phenotype values for each RIL. At submission rename the file to

DREAM5_TeamName_SubChallenge_Predictions.xls
DREAM5_TeamName_SubChallenge_Predictions.csv

where "TeamName" is the name of the team with which you registered for the challenge and "SubChallenge" is either "SysGenB1", "SysGenB2" or "SysGenB3".

Write-up. We request that each participating team submits a short write-up (around two to three pages) explaining the methods used to arrive at their predictions of the phenotypes. This write-up, which is mandatory for submission, can contain pseudo-code, workflows, and explanations of the concepts. Submit the write-up as the file

DREAM5_TeamName_SysGenB_Writeup.ext

replacing "TeamName" with the name of your team and the file extension (ext) with your choice of txt, doc, rtf, or pdf. The submission of this write-up is mandatory for participation in this challenge.

Scoring metrics

Each subchallenge will be scored individually. A p-value for the rank correlation between predicted phenotype values and the actual withheld phenotype values will be calculated. Overall scores for each subchallenge are obtained by combining the p-values for the two phenotypes in a single score:

S = -log(p-value1*p-value2),

where p-valuei refers to the p-value for phenotype i predictions. Teams will be ranked based on this score.


References/Further reading

Jansen, R.C., and Nap, J.P. (2001) Genetical genomics: the added value from segregation. Trends Genet. 17, 388-391 (source of Figure 1)

Jansen, R.C. (2003) Studying complex biological systems using multifactorial perturbation. Nat. Rev. Gen. 4, 145-151

Liu B, de la Fuente A and Hoeschele I. (2008) Gene network inference via structural equation modeling in genetical genomics experiments. Genetics 178, 1763-1776

R. J. Prill, D. Marbach, J. Saez-Rodriguez, P. K. Sorger, L. G. Alexopoulos, X. Xue, N. D. Clarke, G. Altan-Bonnet and G. Stolovitzky (2010) Towards a Rigorous Assessment of Systems Biology Models: The DREAM3 Challenges. PLoS One, 5(2): 9202

Marbach D, Prill RJ, Schaffter T, Mattiussi C, Floreano D and Stolovitzky G., Revealing strengths and weaknesses of methods for gene network inference (2010), Proc Natl Acad Sci U S A. 2010 Apr 6;107(14):6286-91.

Rockman, M.V. (2008) Reverse engineering the genotype-phenotype map with natural genetic variation. Nature 456, 738-744

Stolovitzky G, Prill RJ and Califano A. (2009) Lessons from the DREAM2 Challenges, in Stolovitzky G, Kahlem P, Califano A, Eds, Annals of the New York Academy of Sciences, 1158:159-95

Chen BJ, Causton HC, Mancenido D, Goddard NL, Perlstein EO, and Pe'er D. (2009) Harnessing gene expression to identify the genetic basis of drug resistance. Mol Syst Biol 5, 310


Authors

The challenge has been provided by Alberto de la Fuente, Andrea Pinna and Nicola Soranzo from CRS4 Bioinformatica in Sardinia, Italy, and Ina Hoeschele and Brett Tyler from the Virginia Bioinformatics Institute, VA USA. The challenge has been designed in collaboration with Robert Prill and Gustavo Stolovitzky from the IBM T.J. Watson Research Center in New York and Julio Saez-Rodriguez from Harvard and MIT.

Download

Download Data (Registration Required).

Questions and Feedback

Don't hesitate to post a question in the DREAM Discussion board or directly contact Alberto de la Fuente (alf@crs4.it) or Gustavo Stolovitzky(gustavo@us.ibm.com) if you need any clarification or have a suggestion about this challenge.

Retrieved from "http://wiki.c2b2.columbia.edu/dream/index.php/D5c3"

This page has been accessed 24,761 times. This page was last modified 14:20, 23 December 2010.

x
Find
Browse
The DREAM Project
Community portal
Current events
Recent changes
Random page
Help
Donations
Edit
Edit this page
Editing help
This page
Discuss this page
Post a comment
Printable version
Context
Page history
What links here
Related changes
My pages
Create an account or log in
Special pages
New pages
File list
Statistics
Bug reports
More...