Abstract

Abstract This article extends the problem of variable selection to a nonparametric regression model with categorical covariates. Two selection criteria are considered: the cross-validation (CV) criterion and the accumulated prediction error (APE) criterion. We find that, asymptotically, the CV criterion performs well only when the true model is infinite-dimensional, while the APE criterion is appropriate when the true model is finite-dimensional. This is very similar to the case of linear regression model. A simulation study reveals some interesting small-sample properties of these criteria. To be more specific, suppose that we have observations (X 1, Y 1), …, (Xn, Yn ) that are iid random vectors and X = (X(1), X(2), …), where the X(i)'s are categorical. We allow Y to be of any type. Now a new observation X has arrived and we want to predict the corresponding Y. Such a framework is more appropriate than regressions with fixed covariates in situations where the covariates are observational rather than being controlled. For instance, Y could be the time from HIV infection to developing clinical AIDS, and the covariates (mostly categorical or reducible to categorical) could be observations from blood tests, a physical examination, or further personal information, such as sexual practices obtained from an interview. Take another example: Y could be the premium of an insurance policy with the covariates being the customer's general demographical information. Our goal is to select a subset of covariates that best predict Y. We define the true model dimension as d 0 if the regression function E(Y|X(1), X(2), …) is a d 0-variate function. The main conclusions of the article are: (1) The popular CV criterion performs well only when d 0 = ∞. (2) There exist other criteria that are more appropriate than CV when d 0 < ∞. (3) There is no difference between conditional and unconditional prediction errors, as far as asymptotics are concerned. (4) The selection range has to depend on the sample size. In fact, we argue that, for a given sample size n, we should only select models with the number of covariates not exceeding the order of magnitude of o(log n). (5) Simulation study indicates that the CV criterion has nice small-sample properties.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call