Leveraging domain information to restructure biological prediction

Xiaofei Nan,Xin Dang,Robert J Doerksen,Pankaj R Daga,Dawn Wilkins,Yixin Chen,Ronak Y Patel,Zhengdong Zhao,Gang Fu,Haining Liu,Sheng Liu

doi:10.1186/1471-2105-12-s10-s22

Xiaofei Nan, Xin Dang + Show 9 more

Open Access

https://doi.org/10.1186/1471-2105-12-s10-s22

Copy DOI

Journal: BMC bioinformatics	Publication Date: Oct 18, 2011
Citations: 23	License type: cc-by

Affiliation: University of Mississippi

Abstract

BackgroundIt is commonly believed that including domain knowledge in a prediction model is desirable. However, representing and incorporating domain information in the learning process is, in general, a challenging problem. In this research, we consider domain information encoded by discrete or categorical attributes. A discrete or categorical attribute provides a natural partition of the problem domain, and hence divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The goal of this research is to develop an algorithm to identify discrete or categorical attributes that maximally simplify the learning task.ResultsWe consider restructuring a supervised learning problem via a partition of the problem space using a discrete or categorical attribute. A naive approach exhaustively searches all the possible restructured problems. It is computationally prohibitive when the number of discrete or categorical attributes is large. We propose a metric to rank attributes according to their potential to reduce the uncertainty of a classification task. It is quantified as a conditional entropy achieved using a set of optimal classifiers, each of which is built for a sub-problem defined by the attribute under consideration. To avoid high computational cost, we approximate the solution by the expected minimum conditional entropy with respect to random projections. This approach is tested on three artificial data sets, three cheminformatics data sets, and two leukemia gene expression data sets. Empirical results demonstrate that our method is capable of selecting a proper discrete or categorical attribute to simplify the problem, i.e., the performance of the classifier built for the restructured problem always beats that of the original problem.ConclusionsThe proposed conditional entropy based metric is effective in identifying good partitions of a classification problem, hence enhancing the prediction performance.

Highlights

It is commonly believed that including domain knowledge in a prediction model is desirable
To select the proper discrete/categorical attribute to maximally simplify a classification problem, we propose an attribute selection metric based on conditional entropy achieved by a set of optimal classifiers built for the restructured problem space
Our approach is fundamentally different from the decision tree approach [28]: first, the tree-like restructuring process is to break up the problem into multiple more solvable sub-problems, not to make prediction decisions; second, the splitting criterion we propose here is based on the conditional entropy achieved by a categorical attribute and a hypothesis class, whereas the conditional entropy in decision trees is achieved by an attribute only

Summary

Introduction

It is commonly believed that including domain knowledge in a prediction model is desirable. A discrete or categorical attribute provides a natural partition of the problem domain, and divides the original problem into several non-overlapping sub-problems. In this sense, the domain information is useful if the partition simplifies the learning task. The distance between the learned model and the target function is often quantified as the generalization error, which can be divided into an approximation term and an estimation term. The former is determined by the capacity of the hypothesis class, while the latter is related itself using domain knowledge. A brief synopsis of some of the main findings most related to this article will serve to provide a rationale for incorporating domain information in supervised learning

Objectives

Methods

Results

Conclusion