Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure

Delphine S Courvoisier,Christophe Combescure,Thomas Agoritsas,Angèle Gayet-Ageron,Thomas V Perneger

doi:10.1016/j.jclinepi.2011.06.013

Delphine S Courvoisier, Christophe Combescure + Show 3 more

Open Access

PDF Available

https://doi.org/10.1016/j.jclinepi.2011.06.013

Copy DOI

Export

Save

Cite

Journal: Journal of Clinical Epidemiology	Publication Date: Oct 25, 2011
Citations: 3

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

In their commentary, Steyerberg et al. [[1]Steyerberg E.W. Schemper M. Harrell F. Logistic regression modeling and the number of events per variable: selection bias dominates.J Clin Epidemiol. 2011; 64 ([in this issue]): 1464-1465Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar] point out both a possible solution to decrease bias and an additional source of bias. These comments expand rather than contradict our findings, and we thank the authors for thus providing a more comprehensive view of the various predictors of correct (or incorrect) estimation of parameters [[2]Courvoisier D.S. Combescure C. Agoritsas T. Gayet-Ageron A. Perneger T.V. Performance of logistic regression modeling: beyond the number of events per variable, the role of data structure.J Clin Epidemiol. 2011; 64: 993-1000Abstract Full Text Full Text PDF PubMed Scopus (121) Google Scholar]. Although it is true that there are alternatives to maximum likelihood estimation (MLE) that guarantee convergence, these methods are not implemented by default in most currently available statistical analysis programs. Thus, most researchers will continue to encounter convergence problems. Because convergence problems occur when data are separated or nearly separated (i.e., when the distributions of the predictor under Y=1 and Y=0 do not overlap), statistical programs should provide information on whether or not there is separation. They also could be programmed to apply alternatives to MLE whenever convergence problems occur, particularly when data are separated or nearly separated [3Heinze G. A comparative investigation of methods for logistic regression with separated or nearly separated data.Stat Med. 2006; 25: 4216-4226Crossref PubMed Scopus (254) Google Scholar, 4Heinze G. Ploner M. Fixing the nonconvergence bug in logistic regression with SPLUS and SAS.Comput Methods Programs Biomed. 2003; 71: 181-187Abstract Full Text Full Text PDF PubMed Scopus (59) Google Scholar, 5Heinze G. Schemper M. A solution to the problem of separation in logistic regression.Stat Med. 2002; 21: 2409-2419Crossref PubMed Scopus (1149) Google Scholar]. We also agree that selection bias may be larger than the estimation bias because of overfitting of the model to a specific sample, especially when model selection is purely data-driven. In that situation two phenomena contribute to bias. First, when power is low, only large estimates are statistically significant and are retained in the model (bias because of covariate selection method). Second, if important covariates are omitted because they lack significance, the effects of the remaining covariates may be incorrectly adjusted and hence biased (bias because of under-adjustment). Under-adjustment bias will be potentially stronger if the predictors are more strongly correlated. To explore this issue, we use the simulation results obtained in the main article for seven continuous predictors with a true odds ratio of 1.5 per standard deviation and correlations among predictors of either 0.2 or 0.7 (Fig. 1). Similarly to Steyerberg et al. [[1]Steyerberg E.W. Schemper M. Harrell F. Logistic regression modeling and the number of events per variable: selection bias dominates.J Clin Epidemiol. 2011; 64 ([in this issue]): 1464-1465Abstract Full Text Full Text PDF PubMed Scopus (40) Google Scholar], we either considered the predictors as prespecified or selected only those predictors with statistically significant effects. As expected, the relative bias among the selected coefficients increases as events per variable (EPV) decreases. Moreover, the correlations among predictors influence relative bias in selected models but not in prespecified models. Indeed, for identical EPV, relative bias approximately doubles between selected models when covariates are weakly correlated (r=0.2) and selected models when covariates are strongly correlated (r=0.7). This confirms the impact of omitting nonsignificant confounding factors when estimating the effects of the significant variables. Relative bias in predictors is an important problem for diagnostic and prognostic scores. We agree with Steyerberg et al. that the best statistical and clinical solution to obtain unbiased coefficients is to preselect a set of predictors based on subject knowledge. However, data-driven model selection also will often be necessary when little previous knowledge is available (e.g., discovery of a new pathogen or in genomics). We also agree that the guidelines on the number of EPV appropriate for data-driven model selection should be higher than the guidelines (made for parameter estimation) provided in the main article. Nevertheless, the principle of taking into account data structure to determine the number of EPV remains relevant. Logistic regression modeling and the number of events per variable: selection bias dominatesJournal of Clinical EpidemiologyVol. 64Issue 12PreviewCourvoisier et al. [1] report an important study on the issue of the number of events per variable (EPV) in logistic regression modeling. The article clearly shows that EPV>10 is no guarantee for unbiased estimation of regression coefficients and that there may still be quite some optimism in performance as quantified by the area under the receiver operating characteristic curve. Full-Text PDF

Full Text