Abstract

BackgroundModern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping).ResultsWe transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research.ConclusionWe find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.

Highlights

  • Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets

  • At the beginning of this project, we developed the UltraStructure system while maintaining a traditional ERmodeled database for the project

  • The latter was in place until it became clear that the Ultra-Structure system would reach sufficient utility for day-to-day use

Read more

Summary

Introduction

High-throughput biological experiments generate copious, heterogeneous, interconnected data sets. With frequently changing protocols, techniques, instruments, and file formats Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. There are a large and growing number of publicly accessible repositories for biological information, the largest being GenBank, UniProt, Ensembl, and the UCSC Genome Browser Each of these resources may contribute critical knowledge or data toward solving a biological research problem, but integrating their diverse structures and data types into a unified analysis remains very difficult. The approach is used to reveal alternatively spliced or frameshifted translation products that cannot be readily found by standard proteomic analysis methods This project involves the melding of several data sources that are large and heterogenous: MS-based proteomic data, genome sequences, gene annotation sets, SNP sets, and more. Multiple spectral datasets are used, and they are available in a variety of formats depending on the source

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call