In this paper we describe selective sampling algorithms for decision trees and random forests and their contribution to the classification accuracy. In our selective sampling algorithms, the instance that yields the highest expected utility is chosen to be labeled by the expert. We show that it is possible to obtain the most valuable unlabeled instance to be labeled by the expert and added to the training dataset of the decision tree simply by depicting the influence of this new instance on the class probabilities of the leaves. All the unlabeled instances that fall into the same leaf will have the same class probabilities. As a result, we can compute the expected accuracy of the decision tree according to its leaves instead for each individual unlabeled instance. An extension for random forests is also presented. Moreover, we show that the selective sampling classifier has to belong to the same family as the classifier whose accuracy we wish to improve but need not be identical to it. For example, a random forest classifier can be used for the selective sampling process, and the results can be used to improve the classification accuracy of a decision tree. Likewise, a random forest classifier consisting of three trees can be used in the selective sampling algorithm to improve the classification accuracy of a random forest consisting of ten trees.Our experiments show that the proposed selective sampling algorithms achieve better accuracy than the standard random sampling, uncertainty sampling and the active belief decision tree learning approach (ABC4.5) for several real-world datasets. We also show that our selective sampling algorithms improve significantly the classification performance of several state-of-the-art classifiers such as the random rotation forest classifier for real-world large-scale datasets.
Read full abstract