Variable Selection with Big Data based on Zero Norm and via Sequential Monte Carlo

Jin-Chuan Duan

doi:10.2139/ssrn.3377038

Abstract

Selecting a subset from many potential explanatory variables in linear regressions has long been the subject of research interest, and the matter is made more important in the era of big data when many more variables become available/accessible. Of late, the l_1-norm penalty based techniques such as Lasso of Tibshirani (1996) have become very popular. However, the variable selection problem in its natural setting is a zero-norm penalty problem, i.e., a penalty on the number of variables as opposed to the l_1-norm of the regression coefficients. The popularity of the l_1-norm penalty or its variants has more to do with computational considerations, because selection with the zero-norm penalty is a highly demanding combinatory optimization problem when the number of potential variables becomes large. We devise a sequential Monte Carlo (SMC) method as a practical and reliable tool for zero-norm variable selection problems, and selecting, say, best 20 out of 1,000 potential variables can, for example, be completed with a typical multi-core desktop computer in a couple of minutes. The methodological essence is to understand that the selection problem is equivalent to the task of sampling from a discrete probability function de fined over all possible combinations comprising, say, k regressors out of p>=k potential variables, where the peak of this function corresponds to the optimal combination. The solution technique sets out to sequentially generate samples, and after a while the final sample represents the target probability function. With the fi nal SMC sample in place, we deploy the extreme value theory to assess how likely and to what extent the maximum R2 has been achieved. We also demonstrate through a simulation study the method's reliability and superiority vis-a-vis the adaptive Lasso.

Full Text