In clusterwise regression analysis, the goal is to predict a response variable based on a set of explanatory variables, each with cluster-specific effects. In many real–life problems, the number of candidate predictors is typically large, with perhaps only a few of them meaningfully contributing to the prediction. A well–known method to perform variable selection is the LASSO, with calibration done by minimizing the Bayesian Information Criterion (BIC). However, existing LASSO-penalized estimators are problematic for several reasons. First, only certain types of penalties are considered. Second, the computations may sometimes involve approximate schemes. Third, variable selection is usually time consuming, due to a complex calibration of the penalty term, possibly requiring several multiple evaluations of an estimator for each plausible value of the tuning parameter(s). We introduce a two–step approach to fill these gaps. In step 1, we fit LASSO clusterwise linear regressions with some pre–specified level of penalization (Fit step). In step 2 (Selection step), we perform covariate selection locally, i.e. on the weighted data, with weights corresponding to the posterior probabilities from the previous step. This is done by using a generalization of the Least Angle Regression (LARS) algorithm, which permits covariate selection with a single evaluation of the estimator. In addition, both Fit and Selection steps leverage on an Expectation Maximization (EM) algorithm, fully in closed forms, designed with a very general version of the LASSO penalty. The advantages of our proposal, in terms of computation time reduction, and accuracy of model estimation and selection, are shown by means of a simulation study, and illustrated with a real data application.
Read full abstract