An AUC-based permutation variable importance measure for random forests

Silke Janitza,Carolin Strobl,Anne-Laure Boulesteix

doi:10.1186/1471-2105-14-119

Silke Janitza, Carolin Strobl + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/1471-2105-14-119

Copy DOI

Export

Save

Cite

Journal: BMC Bioinformatics	Publication Date: Apr 5, 2013
Citations: 202	License type: CC BY 2.0

Affiliation: Zimmer Biomet (Germany), University of Zurich

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundThe random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance.ResultsWe investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings.ConclusionsThe standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html.

Highlights

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs)
Why may the error-rate-based permutation VIM fail in case of class imbalance? The prioritisation of the majority class in unbalanced data settings is well known in the context of RF classification and can be seen from trees constructed on unbalanced data
How does this affect the performance of the permutation VIMs? And why is the area under the curve (AUC)-based permutation VIM expected to be more robust towards class imbalance than the commonly used errorrate-based permutation VIM?

Summary

Introduction

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). The classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. In epidemiology, unbalanced data are observed, e.g., in population-based studies where only a small number of subjects develop a certain disease over time, while most subjects remain healthy. Studies on rare diseases are a further example of unbalanced data settings in medicine. Unbalanced data may arise whenever the class memberships are observed after data collection

Objectives

Methods

Results

Conclusion