Overview
From Genetegrate
Project Summary
High-throughput experiments have been increasing the range, diversity and size of biological databases exponentially. A gold mine of new techniques and tools to analyze these, often noisy, data resources is mushrooming in parallel. Each of these databases and tools maintains its own architecture, data representation, naming and access syntax, data manipulations and query semantics, as well as its own confidence measures. Thus, diversity, scale and complexity of information prevent biologists from utilizing all available knowledge and create barriers between communities, e.g., between molecular and system biologists. Scientific progress depends on the ability to harness these diverse resources to create new biological knowledge.
This requires the resolution of five fundamental challenges:
(1) Diversity: unify the syntax and semantics of diverse data sources,
(2) Confidence: manage the intrinsic uncertainties of manipulating noisy data,
(3) Scaling: accelerate access and manipulations of vast amounts of distributed data,
(4) Complexity: enable the navigation through the growing ocean of data and tools, and
(5) Reuse: enable the systemic sharing, reusing and building upon each other’s results within and among research communities.
These are serious challenges both from the computational perspective and from the biological one. The intellectual merit of this proposal lies in the novelty of the approach that is put forward to resolve the challenges. The solutions will be incorporated into the GeneTegrate server system and outreach efforts to distribute GeneTegrate to the research community will be pursued.
The key idea underlying GeneTegrate is to hide the diversity of data and tools “under the hood” by creating unified abstractions of an enriched object-relationship semantic layer. Thus, data are accessed and manipulated as attributes of objects; tools are invoked through respective methods; navigation is handled by traversing relationships; and confidence measures are managed as object properties. Requests to the semantic layer are translated by adapters, which directly access and invocate diverse data sources and tools on the fly. GeneTegrate addresses the diversity and confidence management challenges by unifying the semantic of data access and manipulations; it accelerates access to vast distributed data through classifier-based indexing and look-ahead caching; and it resolves the complexity and reuse challenges through a generalized object-relationship spreadsheet facility. Thus, GeneTegrate will be a front end that allows the seamless access to and integration of any biological data (from DNA and protein sequences to microarray and pathways) taken from any source (database or prediction tool) and analyze them together.
The broad impact of GeneTegrate is that it will enable biologists to reexamine old questions with new means, as well as, to dare formulating new questions that could not have been answered without considerable integration. This facility will be demonstrated by applying GeneTegrate to particular biological problems that are hard or impossible to solve without integrating diverse resources: (i) the prediction of B-cell epitopes and the computerized design of specific antibodies, (ii) the analysis of the role of transmembrane proteins in biological networks.
A proof-of-concept prototype of GeneTegrate has already been completed. The research and development team combines the interdisciplinary skills needed to expand it into a complete, novel solution.
