Abstract

Robust covariance estimation commonly proceeds by downweighting outliers. In this article we measure the “outlyingness” of a data point by the standardized distance between the point and its Kth nearest neighbor. The appropriate weights for robust estimation are found by a model-based mixture modeling approach that follows from considering the data cloud as a realization of a high-dimensional point process. To correct a potential bias when there are no outliers, we introduce a boundary correction procedure that artificially adds in extra outlying points; the resulting methodology is called nearest-neighbor variance estimation (NNVE). The strength of NNVE is its robustness against a large proportion of noise points and against deviation from normality of the signal. A consistency result for the method is established. Under some reasonable assumptions, it is shown that the covariance estimate is bounded and that each point has only bounded influence on the final estimates. NNVE outperformed the popular minimum volume ellipsoid (MVE) estimator in simulation studies, with a big improvement when the proportion of outliers was very large (> 50%). In our simulation study, when the proportion of outliers was ≥ 50%, the mean squared error of the NNVE estimator of variance was at least 100 times smaller than that of the MVE estimator. The proposed estimator also outperformed MVE in cases where the underlying data distribution was not normal. Good performance of NNVE in several real examples is demonstrated. A potential drawback of NNVE is that data points condensed in moderate-sized clusters would be classified as signal. Even though we do not support approaches discarding moderate-sized clusters as outliers without checking, this feature of NNVE could be problematic, particularly when only the main data cloud is of interest. A simple diagnostic tool built on existing model-based clustering procedure is proposed. This procedure allows us to check whether there is more than one separate data cloud in the data after cleaning. It also supplies the central locations of the separated moderate-sized clusters, which allows further investigation. Finally, because NNVE reduces the problem of finding the robustness weights to a one-dimensional problem, it may be useful in high-dimensional problems, such as those encountered in data mining.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call