Comprehensive decision tree models in bioinformatics.

Gregor Stiglic,Igor Pernek,Simon Kocbek,Peter Kokol,Ahmed Moustafa

doi:10.1371/journal.pone.0033812

Abstract

PurposeClassification is an important and widely used machine learning technique in bioinformatics. Researchers and other end-users of machine learning software often prefer to work with comprehensible models where knowledge extraction and explanation of reasoning behind the classification model are possible.MethodsThis paper presents an extension to an existing machine learning environment and a study on visual tuning of decision tree classifiers. The motivation for this research comes from the need to build effective and easily interpretable decision tree models by so called one-button data mining approach where no parameter tuning is needed. To avoid bias in classification, no classification performance measure is used during the tuning of the model that is constrained exclusively by the dimensions of the produced decision tree.ResultsThe proposed visual tuning of decision trees was evaluated on 40 datasets containing classical machine learning problems and 31 datasets from the field of bioinformatics. Although we did not expected significant differences in classification performance, the results demonstrate a significant increase of accuracy in less complex visually tuned decision trees. In contrast to classical machine learning benchmarking datasets, we observe higher accuracy gains in bioinformatics datasets. Additionally, a user study was carried out to confirm the assumption that the tree tuning times are significantly lower for the proposed method in comparison to manual tuning of the decision tree.ConclusionsThe empirical results demonstrate that by building simple models constrained by predefined visual boundaries, one not only achieves good comprehensibility, but also very good classification performance that does not differ from usually more complex models built using default settings of the classical decision tree algorithm. In addition, our study demonstrates the suitability of visually tuned decision trees for datasets with binary class attributes and a high number of possibly redundant attributes that are very common in bioinformatics.

Highlights

Decision trees are one of the most popular classification techniques in data mining [1]
One of the main reasons for this is decision trees’ ability to represent the results in a simple decision tree format which is easy to interpret for experts, as they can see the structure of decisions in the classifying process
The basic idea of the decision tree format is to construct a tree whose leaves are labeled with a particular value for the class attribute and whose inner nodes represent descriptive attributes

Summary

Introduction

Decision trees are one of the most popular classification techniques in data mining [1]. The basic idea of the decision tree format is to construct a tree whose leaves are labeled with a particular value for the class attribute and whose inner nodes represent descriptive attributes. The class value assigned will be that labeling the leaf Following this process one can extract classification rules that can be readily be expressed so that humans can understand them. In addition to their simplicity, building decision trees is often a less time consuming classification process compared to other classification techniques [2], and decision tree rules can be directly used as statements in a database access language (e.g. SQL)

Methods

Results

Discussion

Conclusion