Abstract

AbstractWe propose a method for estimation in high-dimensional linear models with nominal categorical data. Our estimator, called SCOPE, fuses levels together by making their corresponding coefficients exactly equal. This is achieved using the minimax concave penalty on differences between the order statistics of the coefficients for a categorical variable, thereby clustering the coefficients. We provide an algorithm for exact and efficient computation of the global minimum of the resulting nonconvex objective in the case with a single variable with potentially many levels, and use this within a block coordinate descent procedure in the multivariate case. We show that an oracle least squares solution that exploits the unknown level fusions is a limit point of the coordinate descent with high probability, provided the true levels have a certain minimum separation; these conditions are known to be minimal in the univariate case. We demonstrate the favourable performance of SCOPE across a range of real and simulated datasets. An R package CatReg implementing SCOPE for linear models and also a version for logistic regression is available on CRAN.

Highlights

  • Categorical data arise in a number of application areas

  • A more modern and algorithmic method based on these ideas is Delete or merge regressors [Maj-Kanska et al, 2015], which involves agglomerative clustering based on t-statistics for differences between levels

  • We see that fixing γ to be a small value provides leading performance. In these low noise settings, we see that the methods based on first estimating the clusterings of the levels and estimating the coefficients without introducing further shrinkage, such as Delete or merge regressors (DMR) or Bayesian effect Fusion, perform well

Read more

Summary

Introduction

Categorical data arise in a number of application areas. For example, electronic health data typically contain records of diagnoses received by patients coded within controlled vocabularies and prescriptions, both of which give rise to categorical variables with large numbers of levels [Jensen et al, 2012]. One potential drawback of these greedy approaches is that in high-dimensional settings where the search space is very large, they may fail to find good groupings of the levels. An alternative to greedy approaches in high-dimensional settings is using penalty-based methods such as the Lasso [Tibshirani, 1996]. This can be applied to continuous or binary data and involves optimising an objective for which global minimisation is computationally tractable, thereby avoiding some of the pitfalls of greedy optimisation. Inspired by the success of the Lasso and related methods for high-dimensional regression, a variety of approaches have proposed estimating θ0 = (θj0k)j=1,...,p, k=1,...,Kj and μ0 via optimising over (μ, θ) a sum of a least squares criterion

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.