Abstract

Regression analysis is a standard supervised machine learning method used to model an outcome variable in terms of a set of predictor variables. In most real-world applications the true value of the outcome variable we want to predict is unknown outside the training data, i.e., the ground truth is unknown. Phenomena such as overfitting and concept drift make it difficult to directly observe when the estimate from a model potentially is wrong. In this paper we present an efficient framework for estimating the generalization error of regression functions, applicable to any family of regression functions when the ground truth is unknown. We present a theoretical derivation of the framework and empirically evaluate its strengths and limitations. We find that it performs robustly and is useful for detecting concept drift in datasets in several real-world domains.

Highlights

  • Regression models are one of the most used and studied machine learning primitives

  • We propose here a general method for obtaining a threshold δ using only the training data, which according to our empirical evaluation performs well in practice for the datasets used in this paper

  • All values exceeding σemp in the test dataset are considered concept drift. While this cross-validation procedure does not fully account for possible autocorrelation in the training data we found that in our datasets it gives a reasonable estimate of the generalization error in the absence of concept drift

Read more

Summary

Introduction

Regression models are one of the most used and studied machine learning primitives. They are used to model a dependent variable (denoted by y ∈ R) given an m-dimensional vector of covariates (here we assume real valued attributes x ∈ Rm). For a Bayesian regression model the reliability of the estimate can be expressed in terms of the posterior distribution or, more as a confidence interval around the estimate. Another alternative to assess the error of a regression estimate on unseen data is to use (cross-)validation. All of these approaches give some measure of the error on testing data, even when the dependent variable is unknown

Objectives
Methods
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call