Abstract
Decision trees are popular as stand-alone classifiers or as base learners in ensemble classifiers. Mostly, this is due to decision trees having the advantage of being easy to explain. To improve the classification performance of decision trees, some authors have used Multivariate Decision Trees (MDTs), which allow combinations of features when splitting a node. While there is growing interest in the area, recent research in MDTs all have in common that they do not provide adequate comparison of related work: they do not consider relevant rival techniques, or they test algorithm performance in an insufficient number of databases. As a result, claims have no statistical sustain and, hence, there is a lack of general understanding of the actual capabilities of existing MDT induction algorithms, crucial to improving the state-of-the-art. In this paper, we report on an exhaustive review of MDTs. In particular, we give an overview of 37 MDT induction algorithms, out of which we have experimentally compared 19 of them in 57 databases. We provide a statistical comparison in all databases and subsets of databases according to the number of classes, number of features, number of instances, and degree of class imbalance. This allows us to identify groups of top-performing algorithms for different types of databases.
Highlights
Decision trees (DTs) are popular classifiers, partly because their models are easy to explain and because they show remarkable performance
There is not any comprehensive comparison to determine the relative performance of existing Multivariate Decision Trees (MDTs), let alone identifying the top ones. This is both because there are no surveys about MDTs, and because recent papers introducing MDTS suffer from one or two main shortcomings, in terms of the comparison of previous work: authors do not compare their algorithm with relevant rival techniques, or they do so but not in enough databases, and results are insufficient for statistically validating the underlying hypothesis
Our goal with this paper is to fill in this gap; that is, we aim to evaluate the relative merit of MDT induction algorithms to identify how they compare one another
Summary
Decision trees (DTs) are popular classifiers, partly because their models are easy to explain and because they show remarkable performance. Decision tree performance is highly competitive through the use of ensembles; in a recent survey [2], Random Forest [3] and eXtreme Gradient Boosting (XGBoost) [4] are among the top-ranked algorithms. Each branch is tagged with a test, which evaluates to true or false for each object. For branches coming out of the same node, the tests define a partition of the database; so, for each object, one and only one of the tests evaluate to true. The tuple of tests tagging branches from a node is known as a split because they are used to split the objects in a node into disjoint subsets during tree construction. We use split as a verb; to split a node is to select a split and generate the corresponding children nodes
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.