Abstract
This paper presents two R packages ImbTreeEntropy and ImbTreeAUC to handle imbalanced data problems. ImbTreeEntropy functionality includes application of a generalized entropy functions, such as Rényi, Tsallis, Sharma–Mittal, Sharma–Taneja and Kapur, to measure impurity of a node. ImbTreeAUC provides non-standard measures to choose an optimal split point for an attribute (as well the optimal attribute for splitting) by employing local, semi-global and global AUC (Area Under the ROC curve) measures. Both packages are applicable for binary and multiclass problems and they support cost-sensitive learning, by defining a misclassification cost matrix, and weighted-sensitive learning. The packages accept all types of attributes, including continuous, ordered and nominal, where the latter type is simplified for multiclass problems to reduce the computational overheads. Both applications enable optimization of the thresholds where posterior probabilities determine final class labels in a way that misclassification costs are minimized. Model overfitting can be managed either during the growing phase or at the end using post-pruning. The packages are mainly implemented in R, however some computationally demanding functions are written in plain C++. In order to speed up learning time, parallel processing is supported as well.
Highlights
Nowadays, the problem of the imbalanced data plays one of the major roles in machine learning
We show implementation of large collection of generalized entropy functions including Rényi, Tsallis, Sharma-Mittal, Sharma-Taneja and Kapur as the impurity measures of the node in the ImbTreeEntropy algorithm;
The results indicate that ImbTreeEntropy and ImbTreeAUC algorithms are able to outperform other methods, due to the fact that they could identify all 8 classes in the dataset, keeping good accuracy, AUC and Kappa at the same time
Summary
The problem of the imbalanced data plays one of the major roles in machine learning. The tree provides the final class label for each lowest partition (leaf) where each partition is greedily selected by choosing the best partition from a set of possible divisions through optimization of some impurity measure. The tree decides how to divide the classes into two consecutive nodes in an optimal way with respect to the impurity measure. The package accepts all types of the attributes, including continuous, ordered and nominal The novelty of both decision tree algorithms is tested based on 10 benchmarking data sets acquired from the UCI Machine learning repository [8]. The datasets represent binary and multi-class problems with continuous, ordinal or nominal attributes. The remainder of this paper is organized as follows: Section 2 provides an overview of the similar research problems for decision trees learning on the imbalanced dataset as well as the application of non-standard impurity measures.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.