Abstract

We propose a strategy for computing estimators in some non-standard M-estimation problems, where the data are distributed across different servers and the observations across servers, though independent, can come from heterogeneous sub-populations, thereby violating the identically distributed assumption. Our strategy fixes the super-efficiency phenomenon observed in prior work on distributed computing in (i) the isotonic regression framework, where averaging several isotonic estimates (each computed at a local server) on a central server produces super-efficient estimates that do not replicate the properties of the global isotonic estimator, i.e. the isotonic estimate that would be constructed by transferring all the data to a single server, and (ii) certain types of M-estimation problems involving optimization of discontinuous criterion functions where M-estimates converge at the cube-root rate. The new estimators proposed in this paper work by smoothing the data on each local server, communicating the smoothed summaries to the central server, and then solving a non-linear optimization problem at the central server. They are shown to replicate the asymptotic properties of the corresponding global estimators, and also overcome the super-efficiency phenomenon exhibited by existing estimators.

Highlights

  • Distributed computing has become significant in the practice of statistics as well as other branches of data science

  • As the literature on distributed computing is enormous, here we provide a selection of instances of research on distributed computing problems in a variety of statistical/machinelearning contexts: see, e.g. [10], [12], [26], [27], [6], [19], [24]

  • Our goal in this paper is to propose new estimators under the divide and conquer (DC) framework in both the monotone function estimation problem as well as in certain versions of the M-estimation setting of [20] which do not suffer from the super-efficiency problem of the pooled-by-averaging estimators and which recover the limiting properties of the corresponding global estimators

Read more

Summary

Background

Distributed computing has become significant in the practice of statistics as well as other branches of data science. They demonstrate in both problems that the maximal MSE of the pooled-by-averaging estimator over a collection of models in a neighborhood of a fixed model diverges to ∞ with N , while the maximal MSE of the global estimator remains bounded In both BDS and [20], super-efficiency results from computing the nonstandard estimator at each local machine and averaging these estimators at the central server. To avoid this undesirable phenomenon, the key idea is to reverse these steps, i.e., first average the data on each local server in an appropriate manner (which will typically depend on the structure and the dimension of the problem) to obtain essentially sufficient summary statistics which are transferred to the central server. The N pairs will be scrambled across a number of different servers (say L), with the same server hosting data from different subpopulations, as well as data from the same sub-population potentially stored on multiple servers

The new estimator for the regression function
Computational considerations
Characterization of the new estimators
Notation and assumptions
The regression function μ satisfies
Uniformly bounded MSE property of the new estimators
Asymptotic distributions
The location parameter problem
Theoretical properties of the pooled estimator
Discussion
Preparatory lemmas
Limited simulation results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call