Champion-challenger based predictive model selection

Shyam Nath

doi:10.1109/secon.2007.342897

Abstract

The selection of appropriate data mining predictive models is a challenging task. While it is easy to evaluate the model based on the historical data at a given point in time, using confusion matrix and misclassification rate, it is not very easy to ensure that the selected model upon deployment stays the most effective one as newer data comes in. Here we will address the issue of how to continually strive for the best model even after a predictive model is deployed for production use. In the champion-challenger based model selection paradigm, the historical data is used for creating the best or the champion predictive model using criteria like misclassification rate for a given cost matrix. Apart from the champion models, a number of other models are selected which are not as good as the champion model in predictive accuracy using same data. These models are termed as challengers to the current champion model. These models may differ from the champion model in the underlying predictive algorithm, algorithm tuning parameters or in use of model attributes. The predictive modeling starts with the conventional processes such as identifying the business problem that warrants the need for predictive modeling, finding the significant attributes for modeling, data quality analysis, followed by the actual modeling building and evaluation of the models. However, the emphasis is not at finding just the top or the champion model but to find the other models that are close in terms of model performance. The guiding principle here is that the selection of the best predictive model based on the current set of historical data, is not the stamp of approval till eternity. Real world systems that use predictive modeling are complex and dynamic processes and need to incorporate means to capture that. When the champion model is deployed in a production system and is used for predictions, these results are saved in a table. Likewise, the challenger models are also used to score a subset of the data and save the results. The predictions of the challenger models do not impact the real-time predictive use of the system. Based on the time intervals for future predictions, when the future time arrives, the actual results are captured for the same instances of data.

Full Text