Project
From Genetegrate
Contents |
Project Description
Challenges of Integrated in silico Biology
The advent of in silico biology depends on the ability of biologists to harness the rapidly expanding range of vast biological data resources and analysis techniques. Envision a researcher in the not too distant future pursuing an in silico (i.e. computational) drug design study attempting to custom-tailor a drug to the genomic makeup of a patient: people with very similar diagnosis often respond very differently to the same treatment; Differences in DNA sequences are believed to be one of the main reasons for this differential response (for a discussion on pharmacogenomics: Jasny 2003). The researcher might apply the following steps:
- Search for differences between gene expression patterns in different populations of patients with the same disease.
- Identify a patient-specific optimal pathway to be attached by the drug by scanning pathway databases using the genes dominating the expression patterns.
- Identify within each pathway proteins that are potential drug targets, and obtain, from each patient, the sequence of these proteins.
- Predict how the minute differences in sequence between patients (i.e. the different haplotypes) translate to differences in the three-dimensional (3D) structure of the corresponding proteins.
- Use these structural differences on the protein to design/select the drug that optimally manipulates the selected pathway.
Such a study will require the integration and correlation of vast amounts of diverse data as well as of techniques and tools for their analysis. This presents five fundamental challenges:
(i) Diversity: correlating biological data from a large range of diverse sources.
Biological data are distributed among a growing range of diverse databases and data manipulation tools, each having its own architecture, access techniques, data representation, naming convention, semantic organization, and query semantics. Furthermore, this diversity will likely continue and grow with increasing specialization of the communities that generate datasets, databases, and analysis techniques, or apply them; integration is critical to enable interdisciplinary sharing and formation of common base among these communities of interest. In the above drug design example, the databases for pathways, protein structure, and sequences, are likely to have been generated by different communities with differing naming and data representation conventions, even for the same protein data. Resources are unlikely to maintain explicit data on the relationships of respective objects; similarly, the analysis tools – e.g., for identifying differential gene expression, an optimal pathway, or significant deviation of conformations – may have their own specialized data representations. The Diversity challenge is to create technologies that unify access to and manipulations of these diverse data sources.
(ii) Confidence: managing intrinsic uncertainties of biological data and techniques.
The data generated in biochemical experiments, particularly high-throughput methods, is very noisy. This implies that conclusions drawn from the data might be false. In order to establish the statistical significance of their conclusions, biology uses statistical tests (such as the Chi-square test) to calculate confidence levels (typically as P-, E- or Z-values). Similarly, analysis and search tools (e.g., sequence alignments) are based on statistical assumptions and should have confidence values associated with their output. In the above drug design example, the microarray and pathways databases are likely to contain measures of the statistical significance of different data items; similarly, tools for analyzing sequences, structures, pathways and eventually drug targets should also generate measures of confidence for their outputs. While it is generally agreed that confidence measures are a critical component of bioinformatics, there is no agreement on the methods for computing them. The confidence management challenge is to calibrate this diverse set of confidence measures so that confidence levels have the same meaning across databases and tools. The uncertainty reduction challenge is to find methods for combining evidence across databases and methods so as to increase confidence that can be assigned to computed conclusions.
(iii) Scaling: searching vast amounts of distributed data sources efficiently.
Biological databases often involve vast amounts of data and thus require resource intensive searches. It is not atypical for searches of a single database to require minutes or even hours. Successive retrieval and correlations of data from multiple distributed databases extend this time significantly. The pathways, proteins, genes and structure databases of the above drug design example, may be distributed around the globe and the searches involved may require scanning through enormous amounts of data. The scaling challenge is to create technologies that can scale and accelerate these searches at high sensitivity and selectivity.
(iv) Complexity: navigating and manipulating a complex ocean of data and tools.
It is increasingly difficult for users to navigate and exploit the growing myriad of data sources and analysis tools. The drug design researcher may not be familiar with all the details of the different pathway databases or the tools to analyze changes in structural conformations; meaningful abstractions may hide such details “under the hood”. The complexity challenge is to create technologies that simplify access to and use of an expanding world of resources in bioinformatics.
(v) Reuse: enabling scientists to share, reuse and build on each other’s results.
Scientific advance builds through sharing and reuse of results. In silico studies share results by publishing papers, the code of underlying tools, and the raw data from the analyses and/or respective datasets. In the future drug design study the researcher may wish to reuse the models, analyses and data of previous studies of optimal target pathways for related pathologies. The reuse challenge is to replace current ad hoc techniques with systemic technologies for simple sharing and reuse of results.
The GeneTegrate Solution
We propose the development of novel technologies to resolve these challenges and to incorporate these solutions into a GeneTegrate system/server. The GeneTegrate server will empower biologists to pursue in silico studies of greater complexity by integrating a broad range of molecular and system-level data and tools, while hiding the complexity of these resources “under the hood” of simpler unifying abstractions. Similarly, GeneTegrate will empower experimental and computational biologists to integrate their data and tools through simple adapters facilitating their broad sharing. We will pursue outreach efforts to broadly distribute the GeneTegrate server, initially concentrating on structure and system biologists, and to integrate its development with key standardization efforts. We have already completed a successful proof-of-concept prototype that was presented at ISMB 2005, and that provided a comfortable base to undertake these ambitious goals.
Success of our efforts would mean that GeneTegrate provides the foundations for enabling the pursuit of large-scale in silico research through sharing and integration of data resources and analysis tools. In other words, our major objective will not be the delivery of yet another gargantuan merge-it-all method, but to develop the foundations upon which an independently growing web of resources will be able to flourish.
Integration through unified semantic model
Our strategy is to emulate the fundamentals of relational databases and to generalize these to handle biological resources. Loosely speaking, relational databases separate the logical semantic database layer from the physical syntactic layer of accessing the data. The logical layer uses a Data Definition Language (DDL) to construct schema, which abstract the entity-relationship semantics of the data; it uses a Data Manipulation (query) Language (DML) to provide abstractions that allow the navigation, correlation and manipulation of records in terms of these schema. This logical layer is based on a relational algebra defined by operations on the schema (select, project and join). It enables users to organize and manipulate data in terms of simple abstractions that hide the underlying complexity of the raw physical data.
GeneTegrate, likewise, creates a logical semantic layer that provides unifying DDL/DML abstractions in order to hide the underlying raw databases and tools. This semantic layer is based on object-relationship database semantics extended with calibrated confidence measures (Fig. 1).
In our futuristic drug design scenario ( C1), the Modeler (Fig. 1) may include objects (schema) to represent pathways, genes and proteins; each object in turn includes data attributes (e.g., sequence and structure of a protein), as well as, respective methods (e.g., the microarray object may include methods to identify expression patterns associated with pathology). Objects, additionally, include relationship attributes (e.g. a pathway object may include attributes referring to protein objects <included-in> the pathway). Finally, objects may include confidence measures associated with their data, as well as, relationship and methods. These confidence measures enrich the OR database semantics with statistical semantics required for handling uncertainty in biological data and computations (more detailed discussion of this confidence measures semantics below).
The Modeler is connected to databases and to tools through an Adapter Layer (Fig. 1). Database adapters support unified access to respective databases through conversions of naming, data representations and API; they build upon the work of various organizations to create unified ontologies and XML data representations syntaxes ( C4 for details). Application adapters enable integration of diverse application tools to access, manipulate and share data through common Modeler API.
For example, in the drug design scenario, an access to the protein structure object in the Modeler repository will retrieve data from protein structure databases, using the respective adapter. Invocation of a method to predict the structure of a protein given a mutation in sequence will activate, through a respective adapter, an application tool (e.g. a remote server for structure prediction) to perform the evaluation. This tool, in turn, needs to retrieve sequence and structure data from other resources. These data will be retrieved by traversing the corresponding relationships and databases: (microarray, gene), (gene, pathway) (pathway, gene), (gene, protein), (protein, structure). The Modeler presents the computations in terms of object attributes access, invocation of methods and traversal of relationships; it uses the adapters to transform these schema-level data manipulations to respective raw database retrievals and application tools invocations.
This Modeler architecture is based on substantial research of systems that integrate vast and diverse network management data to facilitate their automated analysis carried out by Dr. Yemini and his colleagues (Dupuy 1989, 1991; Wolfson 1991; Yemini 1993, 1994; Goldszmidt 1998; Yemini 2000; Konstantinou 2003b). Of particular relevance is the NESTOR project described comprehensively elsewhere (www.cs.columbia.edu/dcc/nestor).
In summary, GeneTegrate resolves the Diversity challenge by creating semantic layer abstractions to hide the underlying raw database and tools. Diverse data and tools are accessed through respective object attributes and methods.
Simplicity and reuse of in silico models via or spreadsheets
Resolving the Complexity challenge is essential in enabling biologists to pursue manipulations of complex data required by future in silico biology. The GeneTegrate strategy is to resolve both the Complexity and Reuse challenges by generalizing the familiar spreadsheet.
The Object-Relationship Spreadsheet (ORS) generalizes standard spreadsheets in three ways. First, it replaces the square grid of a spreadsheet with a general graph of the relationship topology. Second, it clusters attribute cells into object cells. Third, it permits cells to admit more general graphics than mere rectangles to enable more intuitive interactions; thus a gene cell and a microarray cell may be rendered differently.
A user interacts with ORS much like a standard spreadsheet: entering data into some cells and formulas into other cells; with the relationship graph links propagating the computations among the cells. The data entry parts typically bind an object cell to a respective Modeler object with computational formula expressed in terms of respective methods (Fig. 2).
Consider again the futuristic drug-design scenario ( C1): The user may first query a microarray database to identify significant genes associated with the pathology; these genes and/or their products are then used to identify an optimal pathway to be attacked by the drug. Fig. 2 illustrates an ORS computation of such optimal target pathway.
Several notes are appropriate.
The spreadsheet process organizes the flow of computations in terms of the object-relationship graph, the data and methods maintained by the Modeler. Thus the user can focus on unified semantic structure of the data ignoring its details. In reality, the microarray database may use completely different names and representations of genes than the database providing proteins expressed by these genes, the pathway database may have yet different names and representations of the same proteins; these details are entirely hidden from the user who formulates the computations through an abstract level.
In silico computations often need to manipulate uncertain data and need to support user decisions according to the level of confidence in the data. In the process above the selection of differentiating genes by the method “=MicroarrayDB.Diff” and of target pathways by “=PathwayDB.Target(ProteinDB)” reflect such uncertainties (Fig. 2). A user may wish that computations of MicroarrayDB.Diff” retrieve genes that are differentially expressed at a given confidence level. The next section discusses technologies to accomplish this.
ORS models are primarily constructed through point-and-click on cell objects. Cell objects may be viewed in terms of multiple graphical renderings. A protein secondary structure object may be viewed as a sequence of atoms and coordinates, a sequence of amino acids and labels, packed spheres, sticks, or bands. A user may vary these views as best fits the task. Methods are selected from respective object cell menus and invoked through standard spreadsheet “=”. The spreadsheet of Fig. 2 may be entirely constructed using under 20 point-and-click operations.
ORS models may reuse and build on each other, much like current spreadsheets do. For example, the method “=PathwayDB.Target(ProteinDB)” may be defined by a different spreadsheet and/or use data from another spreadsheet through appropriate referencing. Furthermore, spreadsheets are incorporated as Modeler objects, enabling users to organize, browse and retrieve spreadsheet models as needed. Thus, complex in silico computational models may be simplified through hierarchical organization and composition of spreadsheets, much like standard spreadsheets do.
ORS spreadsheets may be used to program a broad range of services and utilities to manage in silico computations. For example, a spreadsheet can provide publish-subscribe services to update local objects cache whenever a respective object changes. This is accomplished by simply applying the method “=remote_object_reference” of the cached cell to retrieve the remote origin object.
Spreadsheet computations have their limitations too, particularly in handling recursion. We will investigate these potential limitations and methods to best address them. Furthermore, the Modeler API admits broader data manipulation layer mechanisms. These API may be used to develop alternative computational paradigms for simple reusable in silico models.
In Summary, GeneTegrate uses a generalization of spreadsheet computations as its base paradigm for in silico computations. This Object Relationship Spreadsheet (ORS) tool resolves the Complexity and Reuse challenges through familiar spreadsheet mechanisms readily usable by many users with very different background.
Management of confidence measures
The entire variety of data attributes, relationships and methods, encapsulated in the framework of GeneTegrate, is described by varying confidence levels. For example, microarray data is often noisy and features of protein structure may have been computed by a prediction method with intrinsic uncertainty. Modeler objects include attributes to represent this uncertainty and methods to evaluate confidence in respective data. This is considered in more detail below.
Current research in computational biology often measures confidence through the statistical methodology of hypothesis testing which evaluates the confidence that we have in a new hypothesis by comparing the probability that the new hypothesis assigns to the data with the probability that a null hypothesis assigns to the data. This methodology is well studied in the statistics and yields estimates that are known as P-values, E-values, and Z-values. The main problem with this methodology is that the “correct” way of associating distributions with the hypotheses is often unclear. The null hypothesis is especially problematic because, on the one hand, it is supposed to capture everything that we know about the data a priori (i.e. before we conduct the experiment), and on the other hand, it has to be simple to compute in order to be useful. This problem is so severe in modern biology because the complexity of the experiments is such that it is very hard to suggest a reasonable null hypothesis. Consequently, the confidence levels assigned to conclusions are often dubious.
Some of these shortcomings of hypothesis testing are overcome by an alternative methodology that is already commonly used: the empirical testing methodology. In order to use this methodology, we need to have a gold standard set which is a set of objects for which we have a high level of confidence as to their correct labeling (highly reliable experimental measurements). For example, for the task of predicting protein structure we have access to very high-resolution structures (<1.5Å) with accurate temperature factors and B-values. For such data, we can therefore evaluate the confidence of predictions by comparisons to such experiments.
One fruitful way of using empirical tests is to calibrate confidence levels calculated using a hypothesis testing method that generates P-values. The idea is very simple. We order the elements of the gold standard set by increasing P-values. We then construct a calibration graph that maps P-values to their accuracy on the gold standard set. Assuming that the gold standard set is sufficiently large and representative, this calibration graph can be used to map any new prediction P-value to its normalized value. Such calibration will be one of the fundamental operations in GeneTegrate. The calibrated confidence values will be added to the database and will provide spreadsheet users with a simple uniform notion of confidence that they can use in their analysis. While this methodology is clearly sound, it also has problems. Most importantly, we often lack a sufficiently large, accurate, and representative gold standard. With the increase in the sensitivity and accuracy of experimental and analytical methods, however, the severity of this problem may decrease in the future. In fact, one of the goals of this project is to collect, grow, and disseminate gold standard sets, and more importantly to facilitate the construction of such sets.
An area of computer science that has been based on empirical testing is machine learning. In this field, gold standard data are referred to as the “labeled data” that is divided (either randomly or by some particular criteria) into different sets used for training and testing. The goal of machine learning is to predict the labels on the, pretended to be unknown, test set. In order to do that, parameters are adapted to the training set. There are many well-studied machine learning algorithms, including neural networks, decision tree algorithms, support vector machines and boosting. Boosting is particularly useful in the context of confidence measures. It combines many different features, each of which is only slightly correlated with the value of interest, into a single predictor that is highly accurate. One of the surprising properties of Adaboost (the most popular boosting algorithm) is that it can find a good combination even if the training set is relatively small. We plan to use Adaboost for the combination of prediction algorithms in a way that will achieve higher accuracy and sensitivity than any one of them. This approach was already applied by Freund and collaborators: they integrated data from gene promoter sequences, microarray experiments, and the Gene Ontology (GO) database to create models for regulatory gene networks in yeast (Middendorf et al. 2004; Middendorf et al. 2005).
Scale through smart indexing
GeneTegrate addresses the Scaling challenge by incorporating Smart Indexing technologies to accelerate searches in databases of biological sequences and Look-ahead Caching to accelerate distributed data access.
Smart Indexing constructs indices to the occurrences of sequence motifs. These motifs correspond to sequence segments that are evolutionarily highly conserved and thus may be useful as anchor-points for performing sequence alignments. Freund and Ie (GRA that will be supported by the funds requested) are currently working on a smart indexing system that is based on the BLOCKS protein motif database (Henikoff & Henikoff 1996; Pietrokovski et al. 1996; Henikoff et al. 2000).
Smart indexing is, e.g., useful for the analysis of protein sequences. The first step of such analysis is to identify sequences that are similar (homologous) to the sequence of interest (the query sequence). Homologous sequences are likely to share some functional and structural similarities with the query sequence. Careful comparison of homologous sequences (phylogenomics) can reveal evolutionary relations. In order to perform an analysis of this type we need a protein “search engine”. The most popular search engines for identifying protein homologues are BLAST (Altschul & Gish 1996) and its successor PSI-BLAST (Altschul et al. 1997). The performance of these engines scales linearly with the size of the protein database. As protein databases are rapidly increasing, these search engines need increasing CPU or slow down. It usually takes 5-10 minutes to run five iterations of PSI-blast on a single protein. The run-time of search using smart indexing increases much more slowly with the number of database entries. In fact, we anticipate to search through databases many times faster at increased selectivity and sensitivity through using smart indexing.
Look-ahead Caching uses both caching of retrieved data at local GeneTegrate servers to accelerate future searches, as well as, pre-fetching of data anticipated by a given computational process.
Application to biology
We propose the application of GeneTegrate to study two specific biological problems, namely (1) the prediction of B-cell epitopes and the in silico design of antibodies, and (2) the analysis of transmembrane proteins in protein networks. Both objectives will require the development of completely new methods and tools in computational biology. They have in common only that they both require the combination of various resources. On the one hand, GeneTegrate will allow the realization of solutions to both problems that would not exist without GeneTegrate. On the other hand, both problems will drive the development of GeneTegrate. A third task will benefit from the efficient integration of GeneTegrate but will not involve novel development of method, namely the inference of protein function from protein structure in the context of structural genomics.
B-cell epitopes and in silico design of antibodies
The first problem will be the analysis and prediction of B-cell epitopes and the computerized design of antibodies that can bind them. B-cell epitopes are sites on the proteins surface that are specifically recognized by antibodies (Van Regenmortel 1992; Hansson et al. 2000). Identifying the regions in a protein that account for its immunogenicity is one key to understanding the mechanism of the immune response. A ntigen-antibody complexes have also long been used as a model for understanding the general phenomenon of molecular recognition (Jones & Thornton 1997; Lo Conte et al. 1999; Chen et al. 2003) . A better understanding of such complexes could, therefore, also shed light on protein interaction, in general. Of more practical importance is another aspect of the method that we propose to develop: most detailed experiments in molecular biology begin with the design of an antibody that bind specific epitopes on proteins of interest (Goldman 2000; Hansson et al. 2000; Ellis 2001). In fact, progress is often hampered by the difficulties and costs in generating specific antibodies. Thus, a method that successfully predicts interactions between proteins and antibodies could become an essential component of accelerating progress in experimental biology.
The application of bioinformatics to immunology has recently gained attention (Chakraborty et al. 2003; Lund 2005). Of particular interest in this context is the prediction of antigenicity and antigenic response. Various methods attempt to predict the residues in a protein that constitute B-cell epitopes. Unfortunately, the performance of these methods does not suffice, yet to make a difference (Blythe & Flower 2005). One limitation of existing methods was that none of them used all available knowledge to generate its predictions. To identify and analyze all available knowledge many biological databases have to be mined efficiently and elaborate analyses and comparisons of the retrieved data using many different tools have to be performed. For example, 3D structures of complexes between proteins and antibodies are available from structural database. Identifying the epitopes in these complexes requires the retrieval of the complexes, which requires a similarity based query of the database, and the analysis of the structure, sequence and biophysics of both the antibodies and the antigen using various computational tools (such as multiple structure alignment and tools for the identification of binding sites). Sequence databases of epitopes from various experiments contain additional data, but require a completely different set of tools. Integrating all these data into one single unified system is currently impractical. Using GeneTegrate we plan to automate this process and generate the most comprehensive knowledge based of B-cell epitopes and the antibodies that bind them. We will then use this knowledge to predict which parts of a protein are antigenic and predict the sequence of the antibody that could bind them.
Role transmembrane proteins in proteins networks
The second problem that we will address through GeneTegrate is related to the prediction of function for genes and proteins that have no previous annotations. Millions of new genes and proteins were sequenced in recent years. Most of them were not studied in the lab and we have no idea what their biological function may be (Rost et al. 2003). For example, for over 1,400 of the proteins for which we have information as detailed as 3D structures we have no experimental clue whatsoever about function. This constitutes almost 5% of all known structures, an amazingly high fraction in particular considering that the determination of a single protein structure costs about $50K translating to substantial efforts in wet labs. Since experimental annotations are lagging so far behind the pace at which sequences accumulate, in silico annotations may provide the only insight into function. One of the two main approaches is sequence/structure-based (reviews: Rost et al. 2003; Whisstock & Lesk 2003); it relies on insights from biophysics, molecular biology and structural biology. The other and is context-based; it relies on the analysis of networks and/or of high-throughput data (Kelley 2003). GeneTegrate offers a unique opportunity to combine these two approaches, as illustrated in the screenshot from the GeneTegrate prototype introduced at ISMB 2005 (Fig. 3). Another example is that using sequence and structure prediction methods such as those included in the PredictProtein server (Rost et al. 2004), GeneTegrate can identify the functionally un-annotated proteins that are likely to cross membranes. Some of such “integral transmembrane” proteins constitute the communication channels between the different compartments of the cell, or between different cells. It is believed that 20-30% of all proteins have transmembrane regions (Liu & Rost 2001; Bigelow et al. 2004; Daley et al. 2005). Sequence/structure-based molecular tools can quite accurately predict whether a certain protein is inserted into membrane or not. Integrating these results with the knowledge from other sequence/structure-based tools that identify, for example, binding sites for small molecules, and/or that identify more distantly related homologues could give a more specific idea regarding the function of these proteins. Since membrane proteins play a role in transmitting information between different compartments in the cell it is reasonable to expect that they will be strategically located in biological network. By querying microarray and pathways databases we will use GeneTegrate to study the network characteristics of predicted and known transmembrane proteins and try to identify commonalities between the roles that they play in connecting different component of the network. Finally, we will use the network and the molecular data in order to more comprehensively characterize transmembrane proteins. This may facilitate and refine annotations and guesses of their function.
Application to structural genomics
The type of integration enabled by GeneTegrate will be relevant to a wide range of fields in experimental and computational biology. The above examples ( C3.1, C3.2) are just two particular research endeavors that we will pursue for this project. Another application of the GeneTegrate framework will provide a direct connection to many experimental biologists: The Rost lab is extensively involved in structural genomics (Liu et al. 2004). Structural genomics is a term that refers to high-throughput three-dimensional (3D) structure determination and analysis of biological macromolecules (at this stage, primarily individual protein domains). One goal is the determination of a representative structure for each major protein family. A major future challenge for bioinformatics in structural genomics will be the exploitation of experimental structures. In the past, it had often been assumed that the determination of a high-resolution protein structure would provide the best path to unraveling the function of this protein. Structural genomics has demonstrated, how incorrect this assumption really was. We need novel and bette methods that pave the way from structure to function (Eisenberg et al. 2000; Moult & Melamud 2000; Skolnick & Fetrow 2000; Thornton et al. 2000; Hurley et al. 2002). The development of such methods is currently not funded by ten new consortia that have just (Jul 2005) been announced by the Protein Structure Initiative (PSI, NIGMS, NIH). Instead of automated methods, structural genomics consortia currently employ expert analysts who laboriously combine many resources to manually annotate function for every single structure determined (Goldsmith-Fischman & Honig 2003; Laskowski et al. 2003). GeneTegrate will provide an excellent framework for simultaneously making many resources available to the experimental labs that determine structures. We will make our GeneTegrate versions available to at least four of the ten centers (NESG, NYCOMBS, NYSGRC, and the new center in Buffalo) with which the Rost lab has a closer affiliation. Thereby, we will tap into an immediate large user-base that will help improving GeneTegrate at all stages.
Previous work
The basic problem that GeneTegrate addresses has been recognized by many groups, and since many years. The PI encountered first when at EMBL in a group that attempted to collate what grew into one of the first meta methods (GeneQuiz) in response to the challenges of what was then the first “large-scale” sequencing effort (Bork et al. 1992). Many companies spent considerable efforts toward this end; a particular example was BioScout, an annual license for which sold for over one million dollars annually (again the PI was tangentially involved with the development of that software). GeneTegrate relies on many of the previous attempts and is complementary to others. Some of today’s projects toward future integration are still preliminary. Nevertheless, we hope to benefit from their fruits as they mature. Due to the immense body of work, we cannot fully review previous work. Instead, we focus on the major concepts. Most of existing integration projects fall into one or more of the following categories:
Data aggregation and federation: efforts designed at collecting and/or incorporating multiple data sources that exist already with the objective to create a new data repository that includes new data sets derived from the integrated data. Biozon (Yona & Kedem 2005) is one recent representative for this concept; it pre-compiles data from molecular databases and from other sources of information about cells and organisms, and creates a graph of relationships between biological and data objects. Efficiency and level of control are amongst the many arguments for compiling such pre-stored snapshots. One of the serious drawbacks, however, is sociological: since the producers of the original data are not fully visible in the final result (the database record from Biozon), the community of those who will contribute their data and methods is likely to remain rather confined (unless the system dominates the world).
GeneTegrate, in contrast, will not create pre-stored static images; it will integrate all resources dynamically. Those who subscribe to the basic framework will incorporate their tools into one simple data model. End-users will then be able to tap into any biological source “on the fly”. One appealing consequence is that all users will be aware of the origin of a particular source at all times. However, GeneTegrate will also use the data aggregation model for a limited task, namely in order to improve the cross-referencing between databases.
Data source integration and automation: these projects aim at accessing a plethora of analytical services and database interfaces. By registering bioinformatics services in directory servers and by using popular interfaces such as the Simple Object Access Protocol (SOAP) (Box 2000) to interface with these services, such efforts streamline the interaction with the available data sources and allow for automated programs to interact and collect data from computational methods. The myGrid (Stevens et al. 2003) consortium, sponsored by the European Bioinformatics Institute (EBI, England) provides a host of tools and services that facilitate the integration for biological data sources. The Taverna (Oinn et al. 2004) workgroup has been developing a workflow construction apparatus under myGrid that enables scientists to easily assemble and enact biological data workflows. The Object Relationship Spreadsheet (ORS, Fig. 2) mechanism that we propose ( C2.2) will build on the achievements in workflow construction by the Taverna group.
Overall, the above projects types are complementary to GeneTegrate. GeneTegrate will incorporate the more successful projects, thereby enhancing its ability to integrate data. However, these projects focus usually on either data or services, and they do not offer an integrative data model that treats all biological data, from nucleotides to pathways.
Data standardization and cross-standardization: attempts to determine or establish standards for bioinformatics data by agreement (e.g. Semantic Web for Life Sciences), i.e. through the use of published meta-data schema or ontologies. These endeavors allow automated agents to read and “understand” the semantics of the data, thereby, simplifying data exchange between two or more data sources. Obviously, we will adopt important standards in order to speed up integration. However, standardization is a nascent technology and its adoption by the computational biology community continues to be rather slow. One of the recent, promising examples for standardization is the BioPax (www.biopax.org) group creates a data exchange format for biological pathway data by constructing an OWL (www.w3.org/TR/owl-features/) based ontology. The BioPax project nurtures a community of researchers in order to achieve an agreement among major contributors to the pathway database community in constructing this format. Another example for this is the successful exploitation of Resource Description Frameworks (RDF) by the most important protein sequence database UniProt (Apweiler et al. 2004).
The success of these types of attempts will be proportional to their acceptance by the community. We will make an effort to embed successful standards as early as possible. We are currently collaborating with the BioPax initiative and hope to be able to fully adopt their standards by the end of 2006.
Data visualization and navigation: attempts to build a browser or a work environment that includes many bioinformatics tools (e.g. the BioScout from LION Biosciences Ltd.). A prominent example for the integration of databases and related resources is the Sequence Retrieval System (SRS: Etzold et al. 1996) that may be the only resource at this moment still able to parse most, if not all, major databases. Another academic attempt at creating such an environment is the caWorkbench (amdec-bioinfo.cu-genome.org/caWorchBench.htm) that develops a comprehensive data visualization and navigation focused on microarray data within the framework of the cancer Biomedical Informatics Grid (caBIG). This kind of solution is necessarily limited in the number of tools that are integrated and it is easily lagging behind the cutting edge tools in the field (unless integrated by an army of programmers). Thus, this kind of approach is of limited practicality, in particular for the academic environment.
GeneTegrate, in contrast, attempts to establish a front end that is comprehensive and dynamic. By comprehensive we mean that it covers various types of data: DNA/RNA/protein sequence/structure, data from networks and pathways of molecules and cells, and data from microarrays and functional genomics (e.g. 2D gels, mass spec, TAP, and RNAi). By dynamic we mean that GeneTegrate will be able to offer new databases and services shortly after their introduction. This will be possible because GeneTegrate will simply plug in the original resources without requiring any sort of “re-implementation”.
Team background
The Rost lab, leading the project, has been developing and exporting tools successfully for over a decade. Most of these methods required the development of novel algorithms for the analysis of protein sequences, and the prediction of protein structure and function. These include the PHD and PROF methods (prediction of secondary structure (Rost & Sander 1993; Rost 1996, 2005), solvent accessibility (Rost & Sander 1994; Rost 1996, 2005), inter-residue contacts (Punta & Rost 2005), membrane helices (Rost 1996) and membrane strands (Bigelow et al. 2004)), CHOP (Liu & Rost 2004b) and CHOPnet (identification of structural domains) (Liu & Rost 2004a), NORSp (prediction of natively disordered proteins) (Liu & Rost 2003), DSSPcont (continuous secondary structure assignment) (Andersen et al. 2002), AGAPE (fold recognition method for proteins of unknown structure) (Przybylski & Rost 2004), ISIS (prediction of residues that stabilize protein-protein and protein-DNA interactions) (Ofran & Rost 2003), LOCtree (combination of methods that predict subcellular localization) (Nair & Rost 2005). However, the Rost lab has also pioneered making methods for molecular biology publicly available. For instance, PredictProtein was the first and continues to be the most widely used Web server for structure prediction (Rost et al. 1994; Rost et al. 2004), EVA is a server for and a database with a dynamic, comprehensive automatic evaluation of structure prediction servers (Eyrich et al. 2001), and META-PP (Eyrich et al. 2001; Eyrich & Rost 2003) was one of the first single-page interface to state-of-the-art prediction servers. The Rost lab has also taken a leading role in what increasingly marks successful techniques in computational biology, namely methods that build on a plethora of other algorithms. For example, a successful prediction of the most important sites in protein-protein interactions (hot spots) is not possible without the application of a dozen other methods and at least four major databases. We begin to realize that the ease in a comprehensive combination of methods may become the bottleneck for further improvement of prediction techniques. GeneTegrate has begun a path that may not only allow us to do today what we want to do but may also provide options in the future that will become essential.
The Yemini lab has developed and exported several generations of tools, based on similar Modeler architecture as GeneTegrate, for semantic integration of diverse network management data and analysis (Dupuy 1989, 1991; Wolfson 1991; Hegering et al. 1993; Yemini 1994; Goldszmidt 1998; Yemini 2000; Konstantinou 2003b). Network components are instrumented to collect large and diverse operational datasets (e.g., a typical router may have some 10k schema). A central challenge of network management, pioneered by the Yemini lab, is to integrate this diverse raw data, synthesizing higher-level network knowledge and actions to diagnose global failure behaviors and to optimize configurations and performance. The technical challenges involved are not dissimilar to those pursued by GeneTegrate. Seminal technologies and tools created by Prof Yemini's lab were widely exported and commercialized by several companies, including a spin-off company (www.smarts.com), recognized as the leader in automating network fault management. The work pursued by the NESTOR project, (www.cs.Columbia.edu/dcc/nestor, Yemini 2000; Konstantinou 2003b) is of particular relevance to GeneTegrate; NESTOR introduced a particularly powerful Modeler and the Object Spreadsheet Language (OSL) on which GeneTegrate is building (Konstantinou 2003a).
Prof. Freund is a leading authority in the area of machine learning and statistics. The best-known work is on the Adaboost algorithm (Freund & Schapire 1996) for which Drs. Freund and Schapire were awarded the 2003 Gödel prize and the 2004 Paris Kanellakis prize. Other contributions to computational learning theory have been to online learning (Freund 2003), adaptive strategies for game theory (Freund & Schapire 1999; Freund & Opper 2002), and the analysis of Bayesian procedures (Freund 2004). In the last two years, Prof. Freund has devoted most of the research to bioinformatics. Collaborations with Christina Leslie and Chris Wiggins (both Columbia) yielded analyses of gene regulation in yeast (Middendorf et al. 2004; Middendorf 2005) and an application of Support Vector Machines (SVMs) to protein classification (Kuang). An ongoing collaboration with the Rost lab is focusing on a variety of problems in the structural analysis of proteins.
GeneTegrate was conceived over a year ago and has been developed by our team ever since. It is a synergic effort of three research groups with complementary expertise and experience. This combination facilitates the understanding of different aspects of the problem and enables the development of a sophisticated solution that is custom tailored to biological resources but relies on the most recent accomplishments in other fields. Guy Yachdav, the senior programmer of the Rost group, and Yanay Ofran, a postdoc who developed some of the analysis and prediction methods in PredictProtein, led the development of a first proof-of-concept GeneTegrate prototype. They worked with Yemini group and with graduate students on writing the preliminary data model, user interface and the APIs and adapters for interacting with remote services and databases. Earlier this year Dr. Eyal Mozes, a joint postdoc at the Rost and Yemini groups, has joined the team. The first prototype was completed in June 2005; A demo of this prototype (Fig. 3) was presented at the largest international meeting in computational biology, ISMB 2005 in Detroit; the demo was well received and yielded numerous requests for obtaining the code. Currently, we are working on improving the searching and indexing of the system with Freund group.
Research and outreach plan
Plan for outreach and sustainability
The goals of the outreach plan are:
(i) to provide an open distribution of GeneTegrate to academia,
(ii) to create a user community,
(iii) to integrate with complementary and standardization efforts in the community, and
(iv) to transition toward the end of funding into a self-sustaining program.
We will pursue goal (i) through incremental downloadable distributions of GeneTegrate, starting in Q1, 2007 (the next section provides a more detailed schedule of development and releases). We will pursue goal (ii) through building first a core user community, based on the current community of PredictProtein (Rost et al. 2004) users. This community, estimated at 10k stable repeat users is currently generating some 200k queries a year. Our initial releases will target the integration needs of this community, expanding to support broader communities through subsequent incremental releases and web training material. We will seek expansion in the use of GeneTegrate by supporting community publishing/repository of ORS spreadsheet analyses (Q1 2008) and through incorporating GeneTegrate and the ORS into introductory bioinformatics and biology courses (Q1 2009).
We plan to organize a GeneTegrate User Group, starting Q4, 2008 to formalize the GeneTegrate support. We plan to pursue (iii), starting Q1 2006, by seeking to integrate the results of and to support the efforts of, amongst others, the GO, SMBL, Pathway DB, and UniProt integration groups. We also plan to actively recruit and support integration of databases and tools by other groups as part of the GeneTegrate distribution.
Finally, we plan to pursue (iv), starting Q1 2009, with the goal of reaching self-sustaining distribution by the end of the 5 th year. The assumption is that we will have completed all major development for GeneTegrate and that at that point the system will indeed stand alone in the sense that we will only have to provide a clearing house and some basic services. Other than that, we hope that the GeneTegrate will by then have succeeded at being implemented and developed further by most of the groups that develop the essential resources in computational biology. We will accomplish the minimal cleaning house functionality through usage licensing fees for commercial users (academic users will always obtain free use licenses), subject to our institutional guidelines (below). Assuming a fee of $200 per year per user, a nominal target population of about 1,000 commercial users will provide some $200k a year, sufficient to provide initial support. As soon as justified by the level of funding and by university policies, we will consider spinning off this support either through not-for-profit or for-profit organizations.
The self-sustainability plan mentioned above will be designed to comply with the following guidelines imposed by our institutions: Software and data authored under this project will be made widely accessible in a manner consistent with university policies. Current practices allow for free access under university copyrights to use, copy, modify and distribute software and data for research, academic, and non-profit purposes via web-based distribution, while commercial licensing is done following university policies.
Research plan and milestones
We propose to pursue five main research tasks to develop the GeneTegrate technologies and to maximize their impact in conducting complex, large-scale in silico analyses.
Task 1: Modeler development: This task will focus on developing the Modeler repository, adapters and the Biological Modeling Language. Our goal is to have a beta release in Q1 2007; this release will integrate all major sequence, sequence annotations, structure and motifs databases; it will support sequence analysis tools, structure prediction, function prediction and contact prediction tools. We plan a first distribution release, targeting the PredictProtein users community, in Q4 2007; subsequent work will expand the range of databases and tools integrated by GeneTegrate, including databases and tools for microarrays, pathways and networks, going through 2 more releases towards a full release, targeting the broad community of biology researchers, in Q2-4 2009.
Task 2: Confidence management: The core software for performing this task exists and has been used in publications (Middendorf et al. 2004; Middendorf 2005). There are currently two implementations of the algorithm, one in Java (MLJAVA) and one in Matlab. By Q3 2006, we plan to integrate the two versions into a new version of MLJAVA that will be released as open source. By Q1 2007, we plan to integrate MLJAVA with Modeler.
Task 3: Smart Indexing: This task will focus on creating an efficient search engine for protein sequences. The goal is to provide the same functionality as PSI-BLAST with higher accuracy and at a fraction of the time. The basic idea is to create a “smart index” which means an index to all occurrences of highly preserved motifs. We have finished developing a beta version of the indexing software and are planning to release it by Q1 2006.
Task 4: Object-Relationships Spreadsheet (ORS): This task will focus on developing the ORS client and integrating it with the Modeler and application tools. Our goal is to have a beta version by Q1 2007 and a first release by Q3 2007. Thereafter, we will expand the ORS to include facilities for sharing and reuse and for extensibility of the spreadsheet model.
Task 5: Predict B-cell epitopes and design specific antibodies: This task will progress as different sequence and structure databases and services are integrated into GeneTegrate. The full integration of the relevant services will be completed by Q2 2006. Thus, data collection will start by Q3 of that year. Within a year, we will be able to predict the epitopes on proteins. The next phase, namely, the computerized design of antibody specific to the epitopes would be completed by Q2 2009.
Task 6: Analysis of the network role of transmembrane proteins: This task will be based on all the integrated data sources (sequence, structure, microarray and pathways). Hence it will progress hand in hand with the integration. By Q4 2007, we will have some function prediction integration capabilities in place and we will use them to analyze un-annotated structures. These analyses will serve to refine the function prediction system. By Q3 2007, as pathway analysis tools are fully integrated, we will complete data collection and analysis for this project.
Task 7: Outreach: This task will focus on open distribution of GeneTegrate and on the creation and expansion of an active users community (see C6.1).
References Cited
- Altschul SF & Gish W (1996) Local alignment statistics. Methods in Enzymology 266:460-480.
- Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W and Lipman DJ (1997) Gapped Blast and PSI-Blast: a new generation of protein database search programs. Nucleic Acids Research 25:3389-3402.
- Andersen CAF, Palmer AG, Brunak S and Rost B (2002) Continuum secondary structure captures protein flexibility. Structure 10:175-184.
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N and Yeh LS (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32:D115-119.
- Bigelow H, Petrey D, Liu J, Przybylski D and Rost B (2004) Prediction of transmembrane beta-barrels for entire proteomes. Nucleic Acids Research 32:2566-2577.
- Blythe MJ & Flower DR (2005) Benchmarking B cell epitope prediction: underperformance of existing methods. Protein Sci 14:246-248.
- Bork P, Ouzounis C, Sander C, Scharf M, Schneider R and Sonnhammer E (1992) What's in a genome? Nature 358:287.
- Box D, Ehnebuske, D., Kakivaya, G., Layman, A., Mendelsohn, N., Nielsen, H. F., Thatte, S., Winer, D. (2000) Simple Object Access Protocol (SOAP) 1.1. http://www.w3.org/TR/2000/NOTE-SOAP-20000508/.
- Chakraborty AK, Dustin ML and Shaw AS (2003) In silico models for cellular and molecular immunology: successes, promises and challenges. Nat Immunol 4:933-936.
- Chen R, Mintseris J, Janin J and Weng Z (2003) A protein-protein docking benchmark. Proteins 52:88-91.
- Daley DO, Rapp M, Granseth E, Melen K, Drew D and von Heijne G (2005) Global topology analysis of the Escherichia coli inner membrane proteome. Science 308:1321-1323.
- Dupuy A, Schwartz, J., Sengupta, S. and Yemini, Y. (1989) An Object-Oriented Model for Network Management. In: Horowitz E (eds). Object-Oriented Databases and Applications. pp.
- Dupuy A, Sengupta, S., Wolfson, O., Yemini, Y. (1991) NetMate: A Network Management Environment. IEEE Network.
- Eisenberg D, Marcotte EM, Xenarios I and Yeates TO (2000) Protein function in the post-genomic era. Nature 405:823-826.
- Ellis RW (2001) Technologies for the design, discovery, formulation and administration of vaccines. Vaccine 19:2681-2687.
- Etzold T, Ulyanov A and Argos P (1996) SRS: Information retrieval system for molecular biology data banks. Methods in Enzymology 266:114-128.
- Eyrich V, Martí-Renom MA, Przybylski D, Fiser A, Pazos F, Valencia A, Sali A and Rost B (2001) EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17:1242-1243.
- Eyrich VA & Rost B (2003) META-PP: single interface to crucial prediction servers. Nucleic Acids Research 31:3308-3310.
- Freund Y & Schapire RE (1996) Experiments with a new boosting algorithm. In: (eds). Machine Learning: Proceedings of the Thirteenth International Conference. pp 148-156.
- Freund Y & Schapire RE (1999) Adaptive game playing using multiplicative weights. Games and Economic Behavior 29:79-103.
- Freund Y & Opper M (2002) Drifting games and Brownian motion. Journal of Computer and System Sciences 64:113--132.
- Freund Y (2003) Predicting a binary sequence almost as well as the optimal biased coin. Information and Computation 182:73-94.
- Freund Y, Mansour, Y. & Schapire R. E. (2004) Generalization bounds for averaged classifiers. The Annals of Statistics 32:1698-1722.
- Goldman RD (2000) Antibodies: indispensable tools for biomedical research. Trends Biochem Sci 25:593-595.
- Goldsmith-Fischman S & Honig B (2003) Structural genomics: computational methods for structure analysis. Protein Sci 12:1813-1821.
- Goldszmidt G, Yemini, Y. (1998) Delegated Agents for Distributed System Management. IEEE Communications.
- Hansson M, Nygren PA and Stahl S (2000) Design and production of recombinant subunit vaccines. Biotechnol Appl Biochem 32 ( Pt 2):95-107.
- Hegering H-G, Yemini Y, IEEE Communications Society. Committee on Network Operations & Management. and Institute for Educational Services (San Francisco Calif.) (1993) Integrated network management, III : proceedings of the IFIP TC6/WG6.6 third International Symposium on Integrated Network Management : with participation of the IEEE Communications Society CNOM and with support from the Institute for Educational Services, San Francisco, California, USA, 18-23 April, 1993. Amsterdam ; New York: North-Holland.
- Henikoff JG & Henikoff S (1996) Blocks database and its applications. Methods in Enzymology 266:88-104.
- Henikoff JG, Greene EA, Pietrokovski S and Henikoff S (2000) Increased coverage of protein families with the blocks database servers. Nucleic Acids Research 28:228-230.
- Hurley JH, Anderson DE, Beach B, Canagarajah B, Ho YS, Jones E, Miller G, Misra S, Pearson M, Saidi L, Suer S, Trievel R and Tsujishita Y (2002) Structural genomics and signaling domains. Trends Biochem Sci 27:48-53.
- Jasny BR, Roberts, L. (2003) Are We There Yet? Science 302:587.
- Jones S & Thornton JM (1997) Prediction of protein-protein interaction sites using patch analysis. J Mol Biol 272:133-143.
- Kelley BP, R. Sharan, et al. (2003) Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc Natl Acad Sci U S A 100:11394-11399.
- Konstantinou AV (2003a) Towards autonomic networks. Columbia University www.cs.columbia.edu/dcc/nestor/thesis/.
- Konstantinou AV, Yemini, Y. (2003b) Programming Systems for Autonomy. In: (eds). IEEE Autonomic Computing Workshop, Active Middleware Services (AMS 2003). Seattle, WA: pp.
- Kuang R, Ie, E., Wang, K., Wang, K., Siddiqi, M. , Freund, Y., Leslie C. Profile-based string kernels for detection of remote homologs and discriminative motifs. Journal of Bioinformatics and Computational Biology. In press.
- Laskowski RA, Watson JD and Thornton JM (2003) From protein structure to biochemical function? J Struct Funct Genomics 4:167-177.
- Liu J & Rost B (2001) Comparing function and structure between entire proteomes. Protein Sci 10:1970-1979.
- Liu J & Rost B (2003) NORSp: predictions of long regions without regular secondary structure. Nucleic Acids Research 31:3833-3835.
- Liu J, Hegyi H, Acton TB, Montelione GT and Rost B (2004) Automatic target selection for structural genomics on eukaryotes. Proteins: Structure, Function, and Bioinformatics 56:188-200.
- Liu J & Rost B (2004a) Sequence-based prediction of protein domains. Nucleic Acids Research 32:3522-3530.
- Liu J & Rost B (2004b) CHOP proteins into structural domains. Proteins: Structure, Function, and Bioinformatics 55:678-688.
- Lo Conte L, Chothia C and Janin J (1999) The atomic structure of protein-protein recognition sites. J Mol Biol 285:2177-2198.
- Lund O, M. Nielsen, et al. (2005) Immunological Bioinformatics. MIT Press.
- Middendorf M, Kundaje A, Wiggins C, Freund Y and Leslie C (2004) Predicting genetic regulatory response using classification. Bioinformatics 20 Suppl 1:I232-I240.
- Middendorf M, Ziv E and Wiggins CH (2005) Inferring network mechanisms: the Drosophila melanogaster protein interaction network. Proc Natl Acad Sci U S A 102:3192-3197.
- Middendorf M, Kundaje, A., Shah, M., Freund Y. , Wiggins, C. & Leslie, C. (2005) Motif discovery through predictive modeling of gene regulation. In: (eds). RECOMB. pp.
- Moult J & Melamud E (2000) From fold to function. Curr Opin Str Biol 10:384-389.
- Nair R & Rost B (2005) Mimicking cellular sorting improves prediction of subcellular localization. Journal of Molecular Biology 348:85-100.
- Ofran Y & Rost B (2003) Predict protein-protein interaction sites from local sequence information. FEBS Letters 544:236-239.
- Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, Carver T, Glover K, Pocock MR, Wipat A and Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20:3045-3054.
- Pietrokovski S, Henikoff JG and Henikoff S (1996) The Blocks database- a system for protein classification. Nucleic Acids Research 24:197-201.
- Przybylski D & Rost B (2004) Improving fold recognition without folds. Journal of Molecular Biology 341:255-269.
- Punta M & Rost B (2005) PROFcon: novel prediction of long-range contacts. Bioinformatics submitted Jan 2005.
- Rost B & Sander C (1993) Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology 232:584-599.
- Rost B & Sander C (1994) Conservation and prediction of solvent accessibility in protein families. Proteins: Structure, Function, and Genetics 20:216-226.
- Rost B, Sander C and Schneider R (1994) PHD - an automatic server for protein secondary structure prediction. CABIOS 10:53-60.
- Rost B (1996) PHD: predicting one-dimensional protein structure by profile based neural networks. Methods in Enzymology 266:525-539.
- Rost B, Liu J, Nair R, Wrzeszczynski KO and Ofran Y (2003) Automatic prediction of protein function. Cell Mol Life Sci 60:2637-2650.
- Rost B, Yachdav G and Liu J (2004) The PredictProtein server. Nucleic Acids Res 32:W321-326.
- Rost B (2005) How to use protein 1D structure predicted by PROFphd. In: Walker JE (eds). The Proteomics Protocols Handbook. Totowa NJ: Humana, pp 875-901.
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B and Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498-2504.
- Skolnick J & Fetrow JS (2000) From genes to protein structure and function: novel applications of computational approaches in the genomic era. Trends Biotechnol 18:34-39.
- Stevens RD, Robinson AJ and Goble CA (2003) myGrid: personalised bioinformatics on the information grid. Bioinformatics 19 Suppl 1:i302-304.
- Thornton JM, Todd AE, Milburn D, Borkakoti N and Orengo CA (2000) From structure to function: approaches and limitations. Nat Struct Biol 7 Suppl:991-994.
- Van Regenmortel MHV (1992) Structure of antigens. CRC Press.
- Whisstock JC & Lesk AM (2003) Prediction of protein function from protein sequence and structure. Q Rev Biophys 36:307-340.
- Wolfson O, Sengupta, S., Yemini, Y. (1991) Managing Communication Networks by Monitoring Databases. IEEE Transaction on Software Engineering.
- Yemini Y (1994) A Comparative Critical Survey of Network Management Protocol Standards. In: Aidarous S, Plevyack, T. (eds). Network Management into the 21-st Century. IEEE Press, pp.
- Yemini Y, Dupuy, S., Kliger, S., Yemini, S. (1993) Modeling The Semantics of Managed Systems. In: (eds). Second IEEE Workshop on Network Management. pp.
- Yemini Y, Konstantinou A.V., Florissi, D., (2000) NESTOR: An Architecture for Network Self-Management and Organization. IEEE Journal on Selected Areas in Communications 18:758-766.
- Yona G & Kedem K (2005) The URMS-RMS hybrid algorithm for fast and sensitive local protein structure alignment. J Comput Biol 12:12-32.



