Abstract

We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available.

Highlights

  • In the past decade, big data or massive data has drawn dramatically increasing attention all over the world

  • In order to fit a Generalized linear models (GLMs) with p predictors, a typical algorithm searching for the maximum likelihood estimate (MLE) based on the full data of size N requires O(ζN N p2) time to run, where ζN is the number of iterations required for the convergence of the full data MLE algorithm (Wang, Zhu and Ma, 2018)

  • We show that the mean representatives (MR) and score-matching representative (SMR) estimates are as accurate as the full data estimate when available

Read more

Summary

Introduction

Big data or massive data has drawn dramatically increasing attention all over the world. To avoid intensive data communications between nodes and even avoid any raw data transfer, we propose a different data analysis strategy for distributed massive dataset with data localization requirements, named the representative approach. By exchanging only the estimated parameters and the representative data points among parallel computing computers, the representative approach can work well even with slow-speed or restricted network connection It fulfills user privacy or security requirements since analysts perform regression analysis on the representatives without direct access to the raw data. Different from the data binning technique, the representative approach proposed in this paper assumes a given data partition and concentrates on constructing the best smoothing values, which we call representatives, more efficiently for a pre-specified regression model. 1 φ(η){Φ(η)Φ(−η)}−1 exp(η){1 − exp[− exp(η)]}−1 exp(−η){1 − exp[− exp(−η)]}−1 π{(1 + η2)(π2/4 − arctan2(η))}−1

Generalized linear model and score function
Representative approaches and mean representatives
Score-matching representative approach for GLMs
Score-matching representative approach
Simulation studies with logistic regression model
Other GLM examples
Theoretical justification of SMR
Asymptotic properties of MR and SMR for big data
CPU time
A case study: airline on-time performance data
Discussion and conclusion
SMR and MR for linear model
Practical number of iterations for SMR
SMR vs divide-and-conquer for logistic models
SMR vs support points for logistic models
MR and SMR with finer partition
CPU time of MR and SMR
Subset clustering strategy
Findings
More proofs
More on airline on-time performance data

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.