Evolutionary Design of Decision-Tree Algorithms Tailored to Microarray Gene Expression Data Sets

Rodrigo C Barros,Alex A Freitas,Andre C P L F De Carvalho,Marcio P Basgalupp

doi:10.1109/tevc.2013.2291813

Abstract

Decision-tree induction algorithms are widely used in machine learning applications in which the goal is to extract knowledge from data and present it in a graphically intuitive way. The most successful strategy for inducing decision trees is the greedy top-down recursive approach, which has been continuously improved by researchers over the past 40 years. In this paper, we propose a paradigm shift in the research of decision trees: instead of proposing a new manually designed method for inducing decision trees, we propose automatically designing decision-tree induction algorithms tailored to a specific type of classification data set (or application domain). Following recent breakthroughs in the automatic design of machine learning algorithms, we propose a hyper-heuristic evolutionary algorithm called hyper-heuristic evolutionary algorithm for designing decision-tree algorithms (HEAD-DT) that evolves design components of top-down decision-tree induction algorithms. By the end of the evolution, we expect HEAD-DT to generate a new and possibly better decision-tree algorithm for a given application domain. We perform extensive experiments in 35 real-world microarray gene expression data sets to assess the performance of HEAD-DT, and compare it with very well known decision-tree algorithms such as C4.5, CART, and REPTree. Results show that HEAD-DT is capable of generating algorithms that significantly outperform the baseline manually designed decision-tree algorithms regarding predictive accuracy and F-measure.

Highlights

C LASSIFICATION is a machine learning task that aims at building class distribution models by taking into account a set of instances characterized by predictive attributes
Difference in the relative frequency of the most frequent and the least frequent classes in the data set), it seems fair to say that heuristic evolutionary algorithm for designing decision-tree algorithms (HEAD-DT) outperformed the baseline methods in configuration {7 x 14}, given that it generated algorithms whose F-measure values are better than the values achieved by classification and regression trees (CART), C4.5, and REPTree
We report here a comparison of the HEAD-DT results with support vector machines (SVM) results in order to give an idea about the level of predictive performance that could be obtained for the gene expression data sets used in our experiments

Summary

Introduction

C LASSIFICATION is a machine learning task that aims at building class distribution models by taking into account a set of instances characterized by predictive attributes. The outcome of such a model is used for assigning class labels to new instances that are described only by the values of Manuscript received January 8, 2013; revised September 13, 2013; accepted November 12, 2013. Date of publication November 20, 2013; date of current version November 26, 2014. It makes use of the training set to induce a model—abstract knowledge representation—which is, in turn, employed for classifying instances whose class information is unknown (deduction step)

Methods

Results

Conclusion