Statistical properties of sketching algorithms.

D C Ahfock,W J Astle,S Richardson

doi:10.1093/biomet/asaa062

D C Ahfock, W J Astle + Show 1 more

Open Access

https://doi.org/10.1093/biomet/asaa062

Copy DOI

Journal: Biometrika	Publication Date: Jul 30, 2020
Citations: 12	License type: cc-by

Affiliation: University of Cambridge, MRC Biostatistics Unit

Abstract

SummarySketching is a probabilistic data compression technique that has been largely developed by the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset, and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson–Woodruff sketches and their use in single-pass sketching algorithms for linear regression with huge samples. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketching estimators. A key result is a conditional central limit theorem for data-oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean squared error is related to the signal-to-noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two datasets.

Highlights

Sketching is a general probabilistic data compression technique designed for Big Data applications (Cormode, 2011)
In this paper we investigate the statistical properties of sketched regression algorithms, a perspective which has received little attention up to now
Our work is different as we study data oblivious random projections and build our asymptotic results from a conditional central limit theorem for the sketched data matrix

Summary

Introduction

Sketching is a general probabilistic data compression technique designed for Big Data applications (Cormode, 2011). Most existing results on the accuracy of sketching are universal worst case bounds (Woodruff, 2014; Mahoney and Drineas, 2016) This is typical for randomised algorithms, a more detailed error analysis can provide important insights (Halko et al, 2011). The first stage in our analysis is to establish the distributional properties of the sketched estimators with the source dataset fixed This gives a clear statistical picture of the behaviour of the randomised algorithm. Given a statistical model for the response y = Xβ0 + ǫ, for a vector of population parameters β0, and error terms ǫ, we can determine properties of βP and βS by integrating over the conditional distributions of the sketched estimators that take y as fixed

Preliminaries

Worst case bounds

Data oblivious sketches

Data aware sketches

Complete sketching

Partial sketching

Relative efficiency

Combined estimator

One-step correction

Sketching central limit theorem

Sketching estimators

Unconditional results

Human leukocyte antigen dataset

Flights dataset

Synthetic data

Discussion

A Sketching examples

D11 D22 D33 D44

B Proof of Theorem 1

D Variance for partial sketching

E Combined estimator results

Clarkson-Woodruff sketch

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Statistical properties of sketching algorithms.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biometrika

Lead the way for us

Similar Papers

SUPERVISED CLASSIFICATION WITH MATRIX SKETCHING
...
-
, et. al. ...
01 Jan 2018
01 Jan 2018

On randomized sketching algorithms and the Tracy–Widom law
Daniel Ahfock ... Sylvia Richardson
Statistics and Computing | VOL. 33
Daniel Ahfock, et. al.Daniel Ahfock ... Sylvia Richardson
19 Jan 2023
Statistics and Computing | VOL. 33

Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.
Simeone Marino ... Hans A Kestler
PloS one | VOL. 13
Simeone Marino, et. al.Simeone Marino ... Hans A Kestler
30 Aug 2018
PloS one | VOL. 13

An ensemble method for estimating the number of clusters in a big data set using multiple random samples
Mohammad Sultan Mahmud ... Joshua Zhexue Huang
Journal of Big Data | VOL. 10
Mohammad Sultan Mahmud, et. al.Mohammad Sultan Mahmud ... Joshua Zhexue Huang
01 Apr 2023
Journal of Big Data | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Statistical properties of sketching algorithms.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Biometrika