Response-Aided Score-Matching Representative Approaches for Big Data Analysis and Model Selection under Generalized Linear Models

Duo Zheng,Keren Li,Jie Yang

doi:10.3390/a17100456

Abstract

In this paper, we propose an efficient method called the response-aided score-matching representative (RASMR) approach to facilitate massive data model selection and data analysis with generalized linear models (GLMs) and a predetermined data partition due to data localization. Similar to the original score-matching representative (SMR) approach, RASMR constructs an artificial data point, called the representative, for each data block. It then fits a GLM on the representative dataset, which provides not only an efficient approach for massive data analysis but also an ideal solution in response to privacy concerns by avoiding the transfer of sensitive data. By further splitting the data blocks according to the values of the response variables, RASMR can obtain more accurate parameter estimates than SMR. Furthermore, by theoretical justifications and simulation studies, we show that RASMR can be more efficiently utilized for model selection and variable selection for a massive dataset by approximating the Akaike information criterion (AIC) and the aggregated prediction errors for cross-validation, which are commonly used for choosing the most appropriate statistical model and drawing reliable conclusions. We also apply the proposed RASMR approach to the airline on-time performance data, which consists of 371 data files labeled by month, and show that RASMR can be successfully used for selecting the most appropriate model for real massive data analysis.

Full Text