Nonparametric Mean Estimation for Big-But-Biased Data

Ricardo Cao,Laura Borrajo

doi:10.1007/978-3-319-73848-2_5

Abstract

Crawford (The hidden biases in big data, Harvard Business Review, Cambridge, 2013, [2]) has recently warned about the risks of the sentence with enough data, the numbers speak for themselves. Some of the problems coming from ignoring sampling bias in big data statistical analysis have been recently reported by Cao (Inferencia estadistica con datos de gran volumen, La Gaceta de la RSME 18:393–417, 2015, [1]). The problem of nonparametric statistical inference in big data under the presence of sampling bias is considered in this work. The mean estimation problem is studied in this setup, in a nonparametric framework, when the biasing weight function is known (unrealistic) as well as for unknown weight functions (realistic). Two different scenarios are considered to remedy the problem of ignoring the weight function: (i) having a small sized simple random sample of the real population and (ii) having observed a sample from a doubly biased distribution. In both cases the problem is related to nonparametric density estimation. A simulated dataset is used to illustrate the performance of the nonparametric methods proposed in this work.

Full Text