Streaming statistical models via Merge & Reduce

Leo N Geppert,Alexander Munteanu,Katja Ickstadt,Christian Sohler

doi:10.1007/s41060-020-00226-0

Leo N Geppert, Alexander Munteanu + Show 2 more

Open Access

https://doi.org/10.1007/s41060-020-00226-0

Copy DOI

Abstract

Merge & Reduce is a general algorithmic scheme in the theory of data structures. Its main purpose is to transform static data structures—that support only queries—into dynamic data structures—that allow insertions of new elements—with as little overhead as possible. This can be used to turn classic offline algorithms for summarizing and analyzing data into streaming algorithms. We transfer these ideas to the setting of statistical data analysis in streaming environments. Our approach is conceptually different from previous settings where Merge & Reduce has been employed. Instead of summarizing the data, we combine the Merge & Reduce framework directly with statistical models. This enables performing computationally demanding data analysis tasks on massive data sets. The computations are divided into small tractable batches whose size is independent of the total number of observations n. The results are combined in a structured way at the cost of a bounded O(log n) factor in their memory requirements. It is only necessary, though nontrivial, to choose an appropriate statistical model and design merge and reduce operations on a casewise basis for the specific type of model. We illustrate our Merge & Reduce schemes on simulated and real-world data employing (Bayesian) linear regression models, Gaussian mixture models and generalized linear models.

Highlights

In recent times, data sets with a massive number of observations have become more and more present, making scalability one of the main challenges of modern data analysis
We propose a method called Merge & Reduce as a technique to address these scalability limitations in regression analysis
This can be leveraged to design streaming algorithms for a computational problem based on coresets

Summary

Introduction

Data sets with a massive number of observations have become more and more present, making scalability one of the main challenges of modern data analysis. Merge & Reduce is well known in computer science and has mainly been used for transforming static data structures to dynamic data structures with little overhead [8] This can be leveraged to design streaming algorithms for a computational problem based on coresets. For some statistical problems, it is known that small coresets do not exist in the worst case This is true, e.g., for specific generalized linear models, see the lower bounds in [37,39]. International Journal of Data Science and Analytics (2020) 10:331–347 small batches as concise summaries Combining these statistical models via the Merge & Reduce framework, we can again turn an offline algorithm into a data stream algorithm. The algorithms should be amenable to distributed computing environments like MapReduce [16]

Our contribution

Related work

Preliminaries and notation

The principle

Simulation study

Data generation

Linear regression

Linear regression in the presence of mixtures

Poisson regression

Bicycle data

Conclusions

Findings

Compliance with ethical standards

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International journal of data science and analytics	Publication Date: Jun 12, 2020
Citations: 3	License type: open-access

R Discovery Prime

R Discovery Prime

Streaming statistical models via Merge & Reduce

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International journal of data science and analytics

Lead the way for us

Similar Papers

Physics-based Bayesian linear regression model for predicting length of mixed oil
Ziyun Yuan ... Wen Yang
Geoenergy Science and Engineering | VOL. 223
Ziyun Yuan, et. al.Ziyun Yuan ... Wen Yang
26 Jan 2023
Geoenergy Science and Engineering | VOL. 223

Identifying the Influencing Factors for the BMI by Bayesian and Frequentist Multiple Linear Regression Models: A Comparative Study.
R Vijayaragunathan ... Kishore K John
Indian journal of community medicine : official publication of Indian Association of Preventive & Social Medicine | VOL. 48
R Vijayaragunathan, et. al.R Vijayaragunathan ... Kishore K John
01 Jan 2023
Indian journal of community medicine : official publication of Indian Association of Preventive & Social Medicine | VOL. 48

Feature Learning Using Bayesian Linear Regression Model
Siqi Nie ... Qiang Ji
-
Siqi Nie, et. al.Siqi Nie ... Qiang Ji
01 Aug 2014
01 Aug 2014

BFLCRM: A BAYESIAN FUNCTIONAL LINEAR COX REGRESSION MODEL FOR PREDICTING TIME TO CONVERSION TO ALZHEIMER'S DISEASE.
Eunjee Lee ... Hongtu Zhu
The Annals of Applied Statistics | VOL. 9
Eunjee Lee, et. al.Eunjee Lee ... Hongtu Zhu
01 Dec 2015
The Annals of Applied Statistics | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Streaming statistical models via Merge & Reduce

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International journal of data science and analytics