Abstract
Covariate selection is a fundamental step when building sparse prediction models in order to avoid overfitting and to gain a better interpretation of the classifier without losing its predictive accuracy. In practice the LASSO regression of Tibshirani, which penalizes the likelihood of the model by the L1 norm of the regression coefficients, has become the gold-standard to reach these objectives. Recently Lee and Oh developed a novel random-effect covariate selection method called the modified unbounded penalty (MUB) regression, whose penalization function can equal minus infinity at 0 in order to produce very sparse models. We sought to compare the predictive accuracy and the number of covariates selected by these two methods in several high-dimensional datasets, consisting in genes expressions measured to predict response to chemotherapy in breast cancer patients. These comparisons were performed by building the Receiver Operating Characteristics (ROC) curves of the classifiers obtained with the selected genes and by comparing their area under the ROC curve (AUC) corrected for optimism using several variants of bootstrap internal validation and cross-validation. We found consistently in all datasets that the MUB penalization selected a remarkably smaller number of covariates than the LASSO while offering a similar—and encouraging—predictive accuracy. The models selected by the MUB were actually nested in the ones obtained with the LASSO. Similar findings were observed when comparing these results to those obtained in their first publication by other authors or when using the area under the Precision-Recall curve (AUCPR) as another measure of predictive performance. In conclusion, the MUB penalization seems therefore to be one of the best options when sparsity is required in high-dimension. Further investigation in other datasets is however required to validate these findings.
Highlights
When building prediction models, covariate selection is a fundamental step in order to maximize the interpretability of the classifier and to avoid overfitting [1,2,3,4,5] while maintaining the predictive accuracy
Are depicted on the y-axis the regression coefficients of the covariates selected by at least one of the methods. It appears that the covariates selected by the modified unbounded penalty (MUB) are a subset of those obtained by the LASSO
For the covariates retained by both methods, the estimated MUB regression coefficients were greater in absolute value than the LASSO ones
Summary
Covariate selection is a fundamental step in order to maximize the interpretability of the classifier and to avoid overfitting [1,2,3,4,5] while maintaining the predictive accuracy. In medicine and biology for example, predictive biomarkers can be very costly to measure and the larger the number of covariates needed in a predictive model the higher its effective cost In this respect, the LASSO regression has become the gold standard for covariate selection [5, 6]. More recently Lee and Oh [12] developed a novel random-effect covariate selection method called the MUB regression, whose penalization function can equal minus infinity at 0. This method offered promising results in simulations and in toy datasets [13, 14] and deserves further practical investigation
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.