Guest Editors' Introduction: Special Issue on Mining Biological Data

Wei Wang Wei Wang,Jiong Yang Jiong Yang

doi:10.1109/tkde.2005.128

Abstract

MINING biological data is an emerging area of intersection between data mining and bioinformatics. Bioinformaticians have been working on the research and development of computational methodologies and tools for expanding the use of biological, medical, behavioral, or health-related data. Data mining researchers have been making substantial contribution to the development of models and algorithms to meet challenges posed by the bioinformatics research. Some successful examples are frequent pattern discovery on biological molecules, text mining in biomedical literature, information integration, probabilistic modeling of genome sequences, etc. This special issue of the IEEE Transactions on Knowledge and Data Engineering features a collection of 11 papers, selected from 54 submissions, representing recent advances at the frontier of mining biological data. Mining frequent trees is very useful in bioinformatics applications. The first paper, “Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications” by Mohammed J. Zaki, formulates the problem of mining (embedded) subtrees in a forest of rooted, labeled, and ordered trees. The author presents TreeMiner, a novel algorithm to discover all frequent subtrees in a forest, using a new data structure called scope-list. TreeMiner has been proven to be superior to previous methods such as PatternMatcher and has shown promising results in analyzing RNA structure and phylogenetics data sets. In the second paper, “Frequent Substructure-Based Approaches for Classifying Chemical Compounds” by Mukund Deshpande, Michihiro Kuramochi, Nikil Wale, and George Karypis, the authors devise a substructurebased classification algorithm that decouples the substructure discovery process from the classification model construction and uses frequent subgraph discovery algorithms to find all topological and geometric substructures present in the data set. The advantage of this approach is that, during classification model construction, all relevant substructures are available and thus allow the classifier to intelligently select the most discriminating ones. This approach is employed to build models to correctly assign chemical compounds to various classes of interests, which have many applications in pharmaceutical research and are used extensively at various phases during the drug development process. Glycans, or carbohydrate sugar chains, are regarded as the third class of biological molecules, subsequent to DNA and proteins, and the recent advent of glycome informatics has generated an increasing number of glycan structures and annotation data. Glycans play important roles in the development and functioning of multicellular organisms and their structures can be represented by labeled ordered trees. The third paper, “A Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains” byNobuhisaUeda, Kiyoko F. Aoki-Kinoshita, Atsuko Yamaguchi, Tatsuya Akustu, and Hiroshi Mamitsuka, proposes a probabilistic model for mining labeled ordered trees and an EM algorithm for efficient learning. Proteins are the machinery of life. A number of techniques have been developed to classify proteins according to important features in their sequences, secondary structures, or three-dimensional structures. The fourth paper, “Finding Patterns on Protein Surfaces: Algorithms and Applications to Protein Classification” by Xiong Wang, introduces a novel approach to protein classification based on significant geometric patterns on the surface of a protein. The binding in protein-protein interactions exhibits a kind of biochemical stability in cells, which can be described by the mathematical notion of the fixed points. In the fifth paper, “Using Fixed Point Theorems to Model the Binding in Protein-Protein Interactions” by Jinyan Li and Haiquan Li, the authors define a point as a protein motif pair consisting of two traditional protein motifs. They propose a method to discover stable motif pairs of a given function from a large protein interaction sequence data set. With the rapid growth of articles on genomics research, it has become a challenge for biomedical researchers to access this ever-increasing quantity of information to understand the newest discovery of functions of proteins they are studying. To facilitate functional annotation of proteins by utilizing the huge amounts of biomedical literature and transforming the knowledge into easily accessible database formats, the text mining technique thus becomes essential. The sixth paper, “Literature Extraction of Protein Functions Using Sentence Pattern Mining” by Jung-Hsien Chiang and Hsu-Chun Yu, proposes the method of sentence pattern mining to extract protein functions from biomedical literature. Identifying concepts that have already been patented is essential for undertaking new biomedical research. Traditional keyword-based search on patent databases may not be sufficient enough to retrieve all the relevant information, especially for the biomedical domain. The seventh paper, “Information Retrieval and Knowledge Discovery Utilizing a BioMedical Patent Semantic Web” by Sougata Mukherjea, Bhuvan Bamba, and Pankaj Kankar, presents BioPatentMiner, a system that facilitates information retrieval and knowledge discovery from biomedical patents. BioPatentMiner first identifies biological terms IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 8, AUGUST 2005 1019

Full Text