Abstract
Built upon an iterative process of resampling without replacement and out-of-sample prediction, the delete-d cross validation statistic CV(d) provides a robust estimate of forecast error variance. To compute CV(d), a dataset consisting of n observations of predictor and response values is systematically and repeatedly partitioned (split) into subsets of size n – d (used for model training) and d (used for model testing). Two aspects of CV(d) are explored in this paper. First, estimates for the unknown expected value E[CV(d)] are simulated in an OLS linear regression setting. Results suggest general formulas for E[CV(d)] dependent on σ2 (“true” model error variance), n – d (training set size), and p (number of predictors in the model). The conjectured E[CV(d)] formulas are connected back to theory and generalized. The formulas break down at the two largest allowable d values (d = n – p – 1 and d = n – p, the 1 and 0 degrees of freedom cases), and numerical instabilities are observed at these points. An explanation for this distinct behavior remains an open question. For the second analysis, simulation is used to demonstrate how the previously established asymptotic conditions {d/n → 1 and n – d → ∞ as n → ∞} required for optimal linear model selection using CV(d) for model ranking are manifested in the smallest sample setting, using either independent or correlated candidate predictors.
Highlights
Cross validation (CV) is a model evaluation technique that utilizes data splitting
Equation (2) was examined because it was the only explicitly stated estimate for E[CV(d)] found in the literature. This expression gives the expected mean squared error of prediction (MSEP) for using a linear regression model to make a prediction for some future observation at a design point
The random subset design used for making “out-of-sample” predictions when computing the CV(d) statistic more logically is associated with the expected MSEP for using a linear regression model to make a prediction for some future observation at a random X value
Summary
Cross validation (CV) is a model evaluation technique that utilizes data splitting. To describe CV, suppose that each data observation consists of a response value (the dependent variable) and corresponding predictor values (the independent variables) that will be used in some specified model form for the response. (2015) Small Sample Behaviors of the Delete-d Cross Validation Statistic. Stone [5] examined the use of delete-1 cross validation methods for regression coefficient “shrinker” estimation. Numerous authors have discussed and examined the properties of CV(1) in the context of model selection (e.g., [2] [8] [9]). Many researchers have examined CV(d) for one or more d values for actual and simulated case studies involving model selection (e.g., [1] [2] [11]), but not to the extent of exposing any general, finite-sample statistical tendencies of CV(d) as a function of d
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.