Querying and mining biological databases.

Ambuj K Singh

doi:10.1089/153623103322006472

Abstract

THANKS TO INTERNATIONAL SEQUENCING EFFORTS, genome datasets have been growing exponentially in the past few years. The GenBank database, for example, has doubled every 15 months. With such a rapid growth, biological datasets (sequence, structure, expression array, pathway) have become too large to be readily accessible for homology searches, mining, adequate modeling, and understanding. Much of the important information in this enormous and rapidly growing gold mine will be wasted if we do not develop proper tools to access and analyze it efficiently. The growth in genomic information has spurred increased interest in large-scale comparison of genetic strings. Comparative genomics analyzes and compares the genetic material of different species to identify genes and predict their functions. Such genome analysis involves comparison of strings as large as the whole genome of a species. Phylogenetics and evolutionary studies are other important applications that use complete genetic information of different species. Shotgun assembly of the genome requires rapid identification of overlaps across millions of reads. The assembly of a large genome can take months on a single machine, the most expensive part being the identification of overlaps. It is obvious that, at this scale, we need new approaches for rapid sequence alignment. Akin to the growth of sequence databases, the protein structure databases, the expression array databases, and the pathway databases have also been recording significant growth. These databases are intrinsically different from sequence databases. For example, in the case of protein structures, common queries ask for the best alignment (in terms of Root Mean Square Distance) of a given query protein to a set of target database proteins. The desired alignment can be either global (i.e., using the whole query, say for the construction of evolutionary trees or classification) or local (i.e., using parts of the query, say in order to find the active sites). Computing the best alignment is an expensive step if it has to be repeated for all the approximately 20,000 structures in PDB, or for the larger set of predicted structures. Structure comparison defines the conserved core of a protein family by isolating the common ancestry of proteins. This allows one to go beyond the twilight zone, where similarities cannot be detected reliably using sequence alone. Predicting the function of proteins is of great benefit since it is faster and cheaper than experimentation. Characterization and understanding of protein structures is important for identification of functional motifs, and understanding of principles underlying the structure and dynamics of proteins. Just as the sequence and structure databases require the design of new techniques to access, manipulate, and mine datasets, the pathway databases require the design of new techniques for accessing, comparing, and manipulating large graphs. There is significant semantics attached to the nodes (substrates, products) and edges (enzymes, reaction control) of such graphs. The growing amount of biological data and metadata along with novel distance metrics require the design of new index structures and search tools. There is also a need to identify common motifs such as modules in the constructed pathways, and to make predictions based on them. Models, languages, and logics for understanding, simulating, and analyzing pathways are also required to keep pace with the amount of growing data. It is evident that the exploding growth in size and complexity of biological data is on a collision course with current database query techniques, and presents new challenges to biological database design. The new generation of databases have to (a) encompass terabytes of data, often local and proprietary; (b) answer

Full Text