Abstract

SummarySketching is a probabilistic data compression technique that has been largely developed by the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset, and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson–Woodruff sketches and their use in single-pass sketching algorithms for linear regression with huge samples. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketching estimators. A key result is a conditional central limit theorem for data-oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean squared error is related to the signal-to-noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two datasets.

Highlights

  • Sketching is a general probabilistic data compression technique designed for Big Data applications (Cormode, 2011)

  • In this paper we investigate the statistical properties of sketched regression algorithms, a perspective which has received little attention up to now

  • Our work is different as we study data oblivious random projections and build our asymptotic results from a conditional central limit theorem for the sketched data matrix

Read more

Summary

Introduction

Sketching is a general probabilistic data compression technique designed for Big Data applications (Cormode, 2011). Most existing results on the accuracy of sketching are universal worst case bounds (Woodruff, 2014; Mahoney and Drineas, 2016) This is typical for randomised algorithms, a more detailed error analysis can provide important insights (Halko et al, 2011). The first stage in our analysis is to establish the distributional properties of the sketched estimators with the source dataset fixed This gives a clear statistical picture of the behaviour of the randomised algorithm. Given a statistical model for the response y = Xβ0 + ǫ, for a vector of population parameters β0, and error terms ǫ, we can determine properties of βP and βS by integrating over the conditional distributions of the sketched estimators that take y as fixed

Preliminaries
Worst case bounds
Data oblivious sketches
Data aware sketches
Complete sketching
Partial sketching
Relative efficiency
Combined estimator
One-step correction
Sketching central limit theorem
Sketching estimators
Unconditional results
Human leukocyte antigen dataset
Flights dataset
Synthetic data
Discussion
A Sketching examples
D11 D22 D33 D44
B Proof of Theorem 1
D Variance for partial sketching
E Combined estimator results
Clarkson-Woodruff sketch
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call