Nonparametric Sequential Clustering of Data Streams with Composite Distributions

Sreeram C Sreenivasan,Srikrishna Bhashyam

doi:10.1016/j.sigpro.2022.108827

Abstract

We study a sequential nonparametric clustering problem to group a finite set of S data streams into K clusters. The data streams are real-valued i.i.d data sequences generated from unknown continuous distributions. The distributions themselves are organized into clusters according to their proximity to each other based on a certain distance metric. The sequential tests are universal in the sense that they are independent of the underlying configuration of the distribution clusters, and the distributions themselves, as long as the maximum intra-cluster distance is smaller than the minimum inter-cluster distance. We propose sequential nonparametric clustering tests for two cases: (1) K known and (2) K unknown. In both cases, we show that the proposed sequential nonparametric clustering tests stop in finite time almost surely and are universally exponentially consistent. Further, we also bound the asymptotic growth rate of the expected stopping time as probability of error goes to zero. Our results generalize earlier work on sequential nonparametric anomaly detection to the more general sequential nonparametric clustering problem. This generalization also provides a new test for the special case of anomaly detection where the anomalous data streams can follow distinct probability distributions. We also devise a modification of the proposed sequential nonparametric clustering tests that can result in significant computational savings with negligible performance degradation. Simulations show that all our proposed sequential clustering tests outperform the corresponding fixed sample size tests in terms of the expected number of samples for a given probability of error. The simulation results also demonstrate the advantage of our proposed clustering tests in anomaly detection problems with distinct anomalies.

Full Text