Score-matching representative approach for big data analysis with generalized linear models

Keren Li,Jie Yang

doi:10.1214/21-ejs1965

Keren Li, Jie Yang

Open Access

PDF Available

https://doi.org/10.1214/21-ejs1965

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given partition of massive dataset, this approach constructs a representative data point for each data block and fits the target model using the representative dataset. In terms of time complexity, it is as fast as the subsampling approaches in the literature. As for efficiency, its accuracy in estimating parameters given a homogeneous partition is comparable with the divide-and-conquer method. Supported by comprehensive simulation studies and theoretical justifications, we conclude that mean representatives (MR) work fine for linear models or generalized linear models with a flat inverse link function and moderate coefficients of continuous predictors. For general cases, we recommend the proposed score-matching representatives (SMR), which may improve the accuracy of estimators significantly by matching the score function values. As an illustrative application to the Airline on-time performance data, we show that the MR and SMR estimates are as good as the full data estimate when available.

Highlights

In the past decade, big data or massive data has drawn dramatically increasing attention all over the world
In order to fit a Generalized linear models (GLMs) with p predictors, a typical algorithm searching for the maximum likelihood estimate (MLE) based on the full data of size N requires O(ζN N p2) time to run, where ζN is the number of iterations required for the convergence of the full data MLE algorithm (Wang, Zhu and Ma, 2018)
We show that the mean representatives (MR) and score-matching representative (SMR) estimates are as accurate as the full data estimate when available

Summary

Introduction

Big data or massive data has drawn dramatically increasing attention all over the world. To avoid intensive data communications between nodes and even avoid any raw data transfer, we propose a different data analysis strategy for distributed massive dataset with data localization requirements, named the representative approach. By exchanging only the estimated parameters and the representative data points among parallel computing computers, the representative approach can work well even with slow-speed or restricted network connection It fulfills user privacy or security requirements since analysts perform regression analysis on the representatives without direct access to the raw data. Different from the data binning technique, the representative approach proposed in this paper assumes a given data partition and concentrates on constructing the best smoothing values, which we call representatives, more efficiently for a pre-specified regression model. 1 φ(η){Φ(η)Φ(−η)}−1 exp(η){1 − exp[− exp(η)]}−1 exp(−η){1 − exp[− exp(−η)]}−1 π{(1 + η2)(π2/4 − arctan2(η))}−1

Generalized linear model and score function

Representative approaches and mean representatives

Score-matching representative approach for GLMs

Score-matching representative approach

Simulation studies with logistic regression model

Other GLM examples

Theoretical justification of SMR

Asymptotic properties of MR and SMR for big data

CPU time

A case study: airline on-time performance data

Discussion and conclusion

SMR and MR for linear model

Practical number of iterations for SMR

SMR vs divide-and-conquer for logistic models

SMR vs support points for logistic models

MR and SMR with finer partition

CPU time of MR and SMR

Subset clustering strategy

Findings

More proofs

More on airline on-time performance data

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Electronic Journal of Statistics	Publication Date: Jan 1, 2022
Citations: 5	License type: cc-by

R Discovery Prime

Score-matching representative approach for big data analysis with generalized linear models

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Electronic Journal of Statistics

Lead the way for us

Similar Papers

Recent Advancement in Postharvest Loss Mitigation and Quality Management of Fruits and Vegetables Using Machine Learning Frameworks
Abha Singh ... Vishal Jagota
Journal of Food Quality | VOL. 2022
Abha Singh, et. al.Abha Singh ... Vishal Jagota
06 Jun 2022
Journal of Food Quality | VOL. 2022

Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era
Jarod Y L Lee ... Louise M Ryan
The American Statistician | VOL. 71
Jarod Y L Lee, et. al.Jarod Y L Lee ... Louise M Ryan
03 Jul 2017
The American Statistician | VOL. 71

A Hybrid Approach for Big Data Analysis of Cricket Fan Sentiments in Twitter
Durgesh Samariya ... L D Dhinesh Babu
-
Durgesh Samariya, et. al.Durgesh Samariya ... L D Dhinesh Babu
01 Jan 2015
01 Jan 2015

A Dynamic Top-k Query Based on the Improved Grid Multi-Dimensional Index TTI 一种基于改进网格多维TTI索引的动态Top-k查询算法
...
-
, et. al. ...
01 Aug 2019
01 Aug 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Score-matching representative approach for big data analysis with generalized linear models

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Electronic Journal of Statistics