Automatic variable selection in a linear model on massive data

Gabriela Ciuperca

doi:10.1080/03610918.2020.1752377

Abstract

For a linear model on massive data, we propose an aggregated estimator depending on adaptive LASSO estimators. The proposed method allows the reduction of the data storage volume and the introduction of an aggregates estimator which automatically selects, with a probability converging to one, the significant explanatory variables. Moreover, the aggregated estimator, corresponding to the non null true parameters has the same asymptotic Normal law as the adaptive LASSO estimator on the all data. But, the estimator calculated on all data is practically impossible to calculate, for lack of calculation memory or storage, when the model is on massive data. Then, another interest of our method is that it can work around the data processing problem of insufficient memory allocated by statistical software when the observation number is very large. The empirical performance is investigated by a comparative simulation study. A real data example is used to illustrate the usefulness of our method.

Full Text