ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Krzysztof Gajowniczek,Tomasz Ząbkowski

doi:10.3390/electronics10060657

Abstract

This paper presents two R packages ImbTreeEntropy and ImbTreeAUC to handle imbalanced data problems. ImbTreeEntropy functionality includes application of a generalized entropy functions, such as Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja and Kapur, to measure impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC (Area Under the ROC curve) measures. Both packages are applicable for binary and multiclass problems and they support cost-sensitive learning, by defining a misclassification cost matrix, and weighted-sensitive learning. The packages accept all types of attributes, including continuous, ordered and nominal, where the latter type is simplified for multiclass problems to reduce the computational overheads. Both applications enable optimization of the thresholds where posterior probabilities determine final class labels in a way that misclassification costs are minimized. Model overfitting can be managed either during the growing phase or at the end using post-pruning. The packages are mainly implemented in R, however some computationally demanding functions are written in plain C++. In order to speed up learning time, parallel processing is supported as well.

Highlights

Nowadays, the problem of the imbalanced data plays one of the major roles in machine learning
We show implementation of large collection of generalized entropy functions including Rényi, Tsallis, Sharma-Mittal, Sharma-Taneja and Kapur as the impurity measures of the node in the ImbTreeEntropy algorithm;
The results indicate that ImbTreeEntropy and ImbTreeAUC algorithms are able to outperform other methods, due to the fact that they could identify all 8 classes in the dataset, keeping good accuracy, AUC and Kappa at the same time

Summary

Introduction

The problem of the imbalanced data plays one of the major roles in machine learning. The tree provides the final class label for each lowest partition (leaf) where each partition is greedily selected by choosing the best partition from a set of possible divisions through optimization of some impurity measure. The tree decides how to divide the classes into two consecutive nodes in an optimal way with respect to the impurity measure. The package accepts all types of the attributes, including continuous, ordered and nominal The novelty of both decision tree algorithms is tested based on 10 benchmarking data sets acquired from the UCI Machine learning repository [8]. The datasets represent binary and multi-class problems with continuous, ordinal or nominal attributes. The remainder of this paper is organized as follows: Section 2 provides an overview of the similar research problems for decision trees learning on the imbalanced dataset as well as the application of non-standard impurity measures.

Literature Review

Imbalanced Tree Algorithm

Notations

Generalized Entropy Measures

Area Under the ROC Curve

Cost- and Weight-Sensitive Learning

Genarating Splits on Attributes

Learning Phase

Benchmarking Methods

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronics	Publication Date: Mar 11, 2021
Citations: 7	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics

Lead the way for us

Similar Papers

Towards Maximizing the Area Under the ROC Curve for Multi-Class Classification Problems
Ke Tang ... Tianshi Chen
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 25
Ke Tang, et. al.Ke Tang ... Tianshi Chen
04 Aug 2011
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 25

A new evaluation measure for learning from imbalanced data
Nguyen Thai-Nghe ... Lars Schmidt-Thieme
-
Nguyen Thai-Nghe, et. al.Nguyen Thai-Nghe ... Lars Schmidt-Thieme
01 Jul 2011
01 Jul 2011

A Novel Method for Credit Scoring Based on Cost-Sensitive Neural Network Ensemble
Wirot Yotsawat ... Pakaket Wattuya
IEEE Access | VOL. 9
Wirot Yotsawat, et. al.Wirot Yotsawat ... Pakaket Wattuya
01 Jan 2020
IEEE Access | VOL. 9

Anthropometric indicators as discriminators of high body fat in children and adolescents with HIV: comparison with reference methods.
Carlos A Souza Alves Jr ... Diego A Santos Silva
Minerva pediatrics | VOL. 75
Carlos A Souza Alves Jr, et. al.Carlos A Souza Alves Jr ... Diego A Santos Silva
01 Nov 2023
Minerva pediatrics | VOL. 75

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ImbTreeEntropy and ImbTreeAUC: Novel R Packages for Decision Tree Learning on the Imbalanced Datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Electronics