Abstract

In Response: We thank Dr. Arunajadai for his comments about the statistical simulations in our editorial (text NLP, algorithm WMB) demonstrating the perils of stepwise logistic regression.1 This allows us to clarify an ambiguity in the nomenclature of the stepwise automatic variable selection algorithm. Correctly specified, the algorithm should be described as either stepwise forward selection, stepwise backward elimination, or stepwise with forward selection and/or backward elimination; however, the word stepwise itself is also commonly used to refer to any of the three variants or to just the third variant. Arunajadai2 has correctly stated that our particular simulations used the stepwise backward elimination variant. Our simulations used randomly created covariates to demonstrate how commonly there was the creation of spurious associations by stepwise modeling (backward elimination variant). Dr. Arunajadai has also provided R software code to perform the other two variants; he reports that there were no spurious associations with no covariate significant at P < 0.05 using either the forward selection or the forward selection/ backward elimination variants. In his code, Arunajadai estimates a mean intercept model object, i.e., “fit <- glm(y ∼ 1, data = w, family = binomial),” for submission to the stepwise function. The submission of a mean intercept model to the stepwise process cannot identify any association, true or spurious. When a full (all covariates) model, i.e., “fit <- glm(y ∼., data = w, family = binomial)” is used, all three variants have qualitatively the same results of numerous spurious associations (appendix available at www.anesthesia-analgesia.org). The inclusion of noise variables during stepwise modeling regardless of the variant has been demonstrated elsewhere.3–5 Dr. Arunajadai also raised the very interesting question of which information criterion should be used at each step for adding or removing a covariate; he advocates the Bayesian Information Criterion (BIC) in contrast to the Akaike Information Criterion (AIC) used in our simulation. Both the AIC and the BIC are indexes in which twice the negative maximized log likelihood of the model fit is penalized by subtracting either twice the number of model parameters (AIC) or the number of model parameters multiplied by the log of the sample size (BIC). Of the candidate models possible, the model with the higher AIC or higher BIC is favored. As Arunajadai noted, the BIC is more heavily penalized and will produce more parsimonious models (fewer significant covariates). However, there is a competition in choosing between AIC and BIC; the AIC will yield optimal regression estimation while the BIC represents consistent model identification rules. It is not possible to create models with the properties favored by both the AIC and the BIC.6 Using the BIC index in our simulation still produces spurious associations. Automatic variable selection via a stepwise process is a hazardous undertaking. As J. B. Copas3 humorously noted, “If you torture the data for long enough, in the end they will confess …. What more brutal torture can there be than subset selection? The data will always confess, and the confession will usually be wrong.” Nathan L. Pace, MD, MStat Department of Anesthesiology University of Utah Salt Lake City, Utah William M. Briggs, PhD Department of Emergency Medicine New York Methodist Hospital Brooklyn, New York [email protected]

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call