Designing efficient user-friendly biological data management systems.

Zoé Lacroix

doi:10.1089/153623103322006742

Abstract

BIOINFORMATICS can refer to almost any collaborative effort between biologists or geneticists and computer scientists and thus covers a wide variety of traditional computer science domains including data modeling, data retrieval, data mining, data integration, data managing, data warehousing, data cleaning, ontologies, simulation, parallel computing, agent-based technology, grid computing, and visualization. However, applying each of these domains to biomolecular and biomedical applications raises specific and unexpectedly challenging specific research issues. The design of a biological data management system relies on the access and exploitation of information related to diseases, disorders, and condition. This information is available at multiple data sources and requires sophisticated tools for its access and analysis. Biological is growing at a rate unseen since the earliest days of the field. Gene sequencing robots, new experimental methodologies and online data collection devices are causing exponential growth in the amount of raw data on the web that is available to the life scientists. Life scientists need to exploit transparently these large datasets with various new applications to analyze, mine, cluster, and visualize this wealth of information. A transparent biological data management system should provide life scientists the ability to access data and applications despite the lack of explicit knowledge about where the data are stored, how they are structured, where the application is running. Many systems have been developed since the meetings on the Interconnection of Molecular Biology Databases (the first of the series was organized at Stanford University in the San Francisco Bay Area, August 9–12, 1994) and the list of queries of the DOE report on Genome Informatics (Robbins, 1993). Although successful, these systems are often limited and fail to meet all users’ needs while the needs and the problems to address became significantly more complex. During their development and usage, existing approaches collected fruitful experience. The analysis of the past experiences should benefit the research community. It is time to take some distance and try to get the perspective provided by the accurate insight into the specific problem being addressed by each system, why the particular architecture was chosen, its strengths, and any weaknesses it may have, to evaluate them and provide an overall summary of these approaches, and their characteristics (advantages and disadvantages). The diversity of data sources and the multiple of applications often distributed on the Internet raise complex issues related to integration. Traditional integration approaches are typically not addressing the two dimensions of the problem: multi-database systems, mediations and warehouses are data-driven whereas agent architectures (CORBA), Web services, and more recently grids are application-oriented. New approaches integrating both data and applications with the flexibility needed to accommodate life scientists are still needed. The problem is made more complex by the semantic mismatches between scientific resources. Information about scientific objects (e.g., a sequence, a gene) is typically spread over multiple data sources, each providing a different identifier for the object. A biological data management system must integrate them all and reconcile these different identifiers in order to provide life scientists a transparent access to each scientific object. Existing efforts to formalize keys for scientific objects, data formats, and

Full Text