Abstract
In this paper, we propose an efficient method called the response-aided score-matching representative (RASMR) approach to facilitate massive data model selection and data analysis with generalized linear models (GLMs) and a predetermined data partition due to data localization. Similar to the original score-matching representative (SMR) approach, RASMR constructs an artificial data point, called the representative, for each data block. It then fits a GLM on the representative dataset, which provides not only an efficient approach for massive data analysis but also an ideal solution in response to privacy concerns by avoiding the transfer of sensitive data. By further splitting the data blocks according to the values of the response variables, RASMR can obtain more accurate parameter estimates than SMR. Furthermore, by theoretical justifications and simulation studies, we show that RASMR can be more efficiently utilized for model selection and variable selection for a massive dataset by approximating the Akaike information criterion (AIC) and the aggregated prediction errors for cross-validation, which are commonly used for choosing the most appropriate statistical model and drawing reliable conclusions. We also apply the proposed RASMR approach to the airline on-time performance data, which consists of 371 data files labeled by month, and show that RASMR can be successfully used for selecting the most appropriate model for real massive data analysis.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.