THE RECENT PUBLICATION of a draft of the entire human genome (McPherson et al., 2001; Venter et al., 2001) has served to fuel an already explosive area of research in bioinformatics that is involved in deriving meaningful knowledge from proteins and DNA sequences (Alberts et al., 2002). Even with the full human genome sequence now in hand, scientists still face the challenges of determining exact gene locations and functions, observing interactions between proteins in complex molecular machines, and learning the structure and function of proteins, just to name a few. The progress of this scientific research is closely connected to the research in the database community in that analyzing large volumes of biological data sets involves being able to maintain and query large databases (Moussouni et al., 1999; Davidson, 2002). Database management systems (DBMSs) could help support life sciences applications, in a number of different ways. A partial list of tasks that such applications require is: querying large structured databases (such as sequence and graph databases), querying semi-structured (such as published manuscripts), managing data replication, querying distributed data sources, and managing parallelism in high-throughput bioinformatics. Unfortunately, current DBMSs have largely ignored supporting life sciences applications, and consequently, the life sciences researches have been forced to write tools and scripts to perform these tasks. An interesting parallel can be drawn between the state of data management tools in life sciences, and the state of data management tools for business applications, such as a banking application, about three decades ago. Prior to the advent of the relational data model, business data was managed and queried using customized programs/scripts that were developed for each application. Reusing programs, and the algorithms for querying the data, involved rewriting application program and logic, which was very time consuming and expensive. In addition, the querying programs were closely tied to the format that was used to represent the data. Any change in the format of the data representation often would break the querying programs. Furthermore, writing complex queries, such as querying over multiple data sets or posing complex analytical queries, was a daunting task. One of the critical contributions of the relational data model (Codd, 1970) was the introduction of a declarative querying paradigm for business data management, instead of the previously used procedural paradigm. In a declarative querying paradigm, the user expresses the query in a high-level language, like SQL, and the DBMS determines the best strategy for evaluating the query. In addition, the DBMS only presents to the user a logical view of the data against which queries are posed. The physical representation of the data, either on disk or in-memory, can be very different from the logical view. For example, in a relational database management system (RDBMS), indices may be created, and the user doesn’t have to query against the index. The user still queries against logical relations, and the system automatically determines if it is faster to use the indices to answer a query. The user is thus insulated from worrying about various details such as physical organization of data on disk, the exact location of the data, tuning the representation for better performance, and choosing the best plan for evaluating a query. This declarative querying paradigm has been a huge success for relational DBMSs, and today commercial RDBMSs manage terabytes of data, and allow very complex querying on these databases. Database management systems can provide similar benefits to the life sciences community, just as it did three decades ago to the business data management community. Many of the data sets that are used in life sciences are growing at an astonishing rate (such as sequence data at NCBI’s GenBank (NCBI, 2002)), and the queries
Read full abstract