Abstract

What is the value of an individual model in an ensemble of binary classifiers? We answer this question by introducing a class of transferable utility cooperative games called ensemble games. In machine learning ensembles, pre-trained models cooperate to make classification decisions. To quantify the importance of models in these ensemble games, we define Troupe - an efficient algorithm that allocates payoffs based on approximate Shapley values of the classifiers. We argue that the Shapley value of models in these games is an effective decision metric for choosing a high-performing subset of models from the ensemble. Our analytical findings prove that our Shapley value estimation scheme is precise and scalable; its performance increases with the size of the dataset and ensemble. Empirical results on real-world graph classification tasks demonstrate that our algorithm produces high-quality estimates of the Shapley value. We find that Shapley values can be utilized for ensemble pruning and that adversarial models receive a low valuation. Complex classifiers are frequently found to be responsible for both correct and incorrect classification decisions.

Highlights

  • The advent of black box machine learning models raised fundamental questions about how input features and individual training data points contribute to the decisions of expert systems [17, 28]

  • We argue that the Shapley value [41], a solution concept from cooperative game theory, is a model importance metric

  • We propose Troupe, an algorithm which approximates the average of Shapley values in ensemble games and dual games using data

Read more

Summary

Introduction

The advent of black box machine learning models raised fundamental questions about how input features and individual training data points contribute to the decisions of expert systems [17, 28]. There has been interest in how the heterogeneity of models in an ensemble results in heterogeneous contributions of those to the classification decisions of the ensemble [16, 47]. For example one would assume that computer vision, credit scoring and fraud detection systems which were trained on varying quality proprietary datasets output labels for data points with varying accuracy. Another source of varying model performance can be the complexity of models e.g. the number of weights in a neural network or the depth of a classification tree.

Objectives
Methods
Findings
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.