Abstract

Ensemble models refer to methods that combine a typically large number of weak learners into a stronger composite model. The output of an ensemble method is the result of fitting a base-learning algorithm to a given data set, and obtaining diverse answers by re-weighting the observations or by re-sampling them using a given probabilistic selection. A key challenge of using ensembles in large-scale multidimensional data lies in the complexity and the computational burden associated with them. The models created by ensembles are often difficult, if not impossible, to interpret and their implementation requires more computational power than individual learning algorithms. Recent research effort in the field has concentrated on reducing ensemble size, while maintaining predictive accuracy. We propose a method to prune an ensemble solution by optimizing its margin distribution, while increasing its diversity. The proposed algorithm results in an ensemble that uses only a fraction of the original weak learners, with generally improved estimated generalization performance. We analyze and test our method on both synthetic and real data sets. The analysis shows that the proposed method compares favorably to the original ensemble solutions and to other existing ensemble pruning methodologies.

Highlights

  • INTRODUCTIONEnsemble methods combine a large number of fitted values (sometimes in the hundreds) into a composite prediction

  • Ensemble methods combine a large number of fitted values into a composite prediction

  • Ensembles generally perform strongly in terms of their generalization ability compared to individual classifiers, the application of ensembles in large scale, high velocity data sets, creates challenges given the more complex nature of these learning algorithms

Read more

Summary

INTRODUCTION

Ensemble methods combine a large number of fitted values (sometimes in the hundreds) into a composite prediction. The term boosting refers to a family of methods that combine weak learner (classification algorithms that perform at least slightly better than random) into a strong performing ensemble through weighted voting. Interpretations of ensemble predictions are not as straightforward as those of single learning algorithms and the implementation of the resulting models requires fitting the data through all of the iterations (sometimes in the hundreds) of the ensemble. A high number of iterations is oftentimes necessary to reap the benefits of the improved generalization performance provided by ensembles [5], [7] For this reason, recent research effort has concentrated on reducing ensemble sizes, called ensemble pruning (thinning), while trying to maintain or improve their predictive accuracy (see, e.g., [18]–[27]). In this article we propose an algorithm that produces a reduced, strong-performing sub-ensemble by optimizing the diversity of the weak learners and maximizing its lower margin distribution. The proposed method is a weight-based quadratic optimization formulation that aims to tune the weights of a given ensemble, such that the pairwise correlations of the weak learners and the margin variance are minimized, while the lower percentiles of the margin distribution of the ensemble are maximized

PRELIMINARIES
BOOSTING ALGORITHMS
DIVERSITY AND ENSEMBLE PERFORMANCE
ENSEMBLES UNDER NOISE
SELECTION-BASED METHODS
PROPOSED PRUNING ALGORITHM
EXPERIMENTS AND SIMULATIONS
QMM PERFORMANCE ON BENCHMARK DATA SETS
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call