Abstract
We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available.
Highlights
In the past decade, big data or massive data has drawn dramatically increasing attention all over the world
In order to fit a Generalized linear models (GLMs) with p predictors, a typical algorithm searching for the maximum likelihood estimate (MLE) based on the full data of size N requires O(ζN N p2) time to run, where ζN is the number of iterations required for the convergence of the full data MLE algorithm (Wang, Zhu and Ma, 2018)
We show that the mean representatives (MR) and score-matching representative (SMR) estimates are as accurate as the full data estimate when available
Summary
Big data or massive data has drawn dramatically increasing attention all over the world. To avoid intensive data communications between nodes and even avoid any raw data transfer, we propose a different data analysis strategy for distributed massive dataset with data localization requirements, named the representative approach. By exchanging only the estimated parameters and the representative data points among parallel computing computers, the representative approach can work well even with slow-speed or restricted network connection It fulfills user privacy or security requirements since analysts perform regression analysis on the representatives without direct access to the raw data. Different from the data binning technique, the representative approach proposed in this paper assumes a given data partition and concentrates on constructing the best smoothing values, which we call representatives, more efficiently for a pre-specified regression model. 1 φ(η){Φ(η)Φ(−η)}−1 exp(η){1 − exp[− exp(η)]}−1 exp(−η){1 − exp[− exp(−η)]}−1 π{(1 + η2)(π2/4 − arctan2(η))}−1
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.