The presence of big data, characterized by exceptionally large sample size, often brings the challenge of outliers and data distributions that exhibit heavy tails. An online learning estimation that incorporates anti-outlier capabilities while not relying on historical data is therefore urgently required to achieve robust and efficient estimators. In this paper, we introduce an innovative online learning approach based on a mode kernel-based objective function, specifically designed to address outliers and heavy-tailed distributions in the context of big data. The developed approach leverages mode regression within an online learning framework that operates on data subsets, which enables the continuous updating of historical data using pertinent information extracted from a new data subset. By amalgamating the asymptotic distribution functions generated by local mode estimators from all data subsets, the newly suggested estimator is efficiently updated by minimizing a weighted least squares-type loss function. To facilitate this process, we suggest a modified mode expectation-maximization algorithm for numerical optimization, ensuring both storage friendliness and computational efficiency. We demonstrate that the resulting estimator is asymptotically equivalent to the mode estimator calculated using the entire dataset, provided that the covariates in each data subset are homogeneous and the errors are homoskedastic. Monte Carlo simulations and an empirical study are presented to illustrate the finite sample performance of the proposed estimator. Supplemental materials for this paper are available online.
Read full abstract