Editorial: Special Section on High-Performance Computational Biology

S Aluru,D.A Bader,N.M Amato

doi:10.1109/tpds.2006.102

Abstract

OVER the past decade, computational molecular biology has grown into a mature discipline with a well-defined body of core knowledge, and participation from a large and diverse group of researchers. To keep pace with the explosive growth in research in this field, a number of high quality journals and annual conferences have been established. Many universities are actively building academic programs and research centers and groups in computational biology. As a reflection of the maturing of the field, numerous textbooks on computational biology and its various subtopics have been written in recent years, and undergraduate programs are underway. Despite this progress, computational biology continues to be a vibrant discipline with many outstanding research problems and potential for new avenues of investigation for decades to come. We broadly view high-performance computational biology as the development and application of high-performance computing techniques for extending the reach or scale of investigations in computational biology. A major component of this is the development of parallel and distributed algorithms, and programming environments and systems for aiding biological investigations using highperformance parallel computers, grid computing, and emerging architectures. There is a compelling need for such research given the explosive growth in biological information, the complexity of interactions that underlie many biological processes, and the diversity and interconnectedness of organisms at the molecular level. However, research in high-performance computational biology has not grown as rapidly as computational biology itself. There are subfields of computational biology which have not seen significant influx of ideas from the high-performance computing community. This is perhaps a reflection of the confluence of expertise needed to conduct research in high-performance computational biology, which sets up a barrier to entry for new researchers. Efforts spent in transgressing the barrier are worthwhile given the opportunities for high impact research. By bringing together research in this area as a special section, we hope to provide a resource for IEEE Transactions on Parallel and Distributed Systems (TPDS) readers interested in this field and aid the entry of new researchers into the field. The arguments in favor of a sustained effort in highperformance computational biology are stronger than ever. New high-throughput sequencing machines introduced within the last year, such as those from 454 Life Sciences Inc., have significantly accelerated sequencing capabilities. Using 454 sequencing systems, it is possible to sequence as many as 200,000 short DNA fragments in a 4 hour experiment for a few thousand dollars. These machines are increasingly being used to sample transcriptomes of many organisms. The sequencing of several complex plant genomes is underway starting with maize and sorghum. Similar to large-scale genome sequencing projects, comprehensive gene expression profile measurement projects are underway to conduct large-scale microarray experiments on an organism spanning various organs, diesease/stress induced states, and developmental stages. Forays into personalized medicine, rational drug design, large-scale systems biology, such as the study of protein-protein interaction networks at the whole organism level, understanding evolutionary relationships and building the tree of life, all require processing vast amounts of data or carrying out highly complex computational tasks. In this special section, we showcase some of the recent work in high-performance computational biology. In addition to the open call for papers, authors whose work was published in the 2005 IEEE International Workshop on HighPerformance Computational Biology (HiCOMB, http:// www.hicomb.org) were solicited to submit extended versions of their papers. Each manuscript submitted to the special section was subjected to rigorous, independent peer review by three to four reviewers. We are extremely grateful to all the reviewers who agreed and delivered on providing thoughtful reviews within the time constraints imposed for the special issue. Based on the reviewer suggestions and our own reading of the manuscripts, six manuscripts were selected for publication in the special section. The first paper in this special issue is on a scalable implementation of the widely used BLAST search program for homology detection between a query sequence and a database of known sequences. In “ScalaBLAST: A Scalable Implementation of BLAST for High-Performance DataIntensive Bioinformatics Analysis,” Christopher Oehmen and Jarek Nieplocha report on ScalaBLAST, a high-performance sequence alignment program they developed to enable applications that require thousands to millions of queries to be performed simultaneously. Such queries are used in applications such as multiple genome/proteome comparisons, and in finding genes in newly sequenced genomes. By using a combination of techniques, including target database distribution, exploiting multilevel parallelism, parallel I/Os and latency hiding, the authors achieve a scalable implementation of this ubiquitous search program. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 17, NO. 8, AUGUST 2006 737

Full Text