Abstract

Online data stream mining is of great significance in practice because of its ubiquity in many real-world scenarios, especially in the big data era. Traditional data mining algorithms cannot be directly applied to data streams due to (1) the possible change of underlying data distribution over time (i.e.,concept drift) and (2) delayed, short, or even no labels for streaming data in practice. A new research area, namedunsupervised concept drift detection, has emerged to tackle this difficulty mainly based on two-sample hypothesis tests, such as the Kolmogorov–Smirnov test. However, it is surprising that none of the existing methods in this area exploit the Bayesian nonparametric hypothesis test, which has clear interpretability and straightforward prior knowledge encoding ability and no strict or unrealistic requirement of prefixing the form for the underlying data distribution. In this article, we present a Bayesian nonparametric unsupervised concept drift detection method based on the Polya tree hypothesis test. The basic idea is to decompose the underlying data distribution into a multi-resolution representation that transforms the whole distribution hypothesis test into recursive and simple binomial tests. Also, an incremental mechanism is especially designed to improve its efficiency in the stream setting. The method effectively detect drifts, and it also locates where a drift happens and the posteriors of hypotheses. The experiments on synthetic data verify the desired properties of the proposed method, and the experiments on real-world data show the better performance of the method for data stream mining compared with its frequentist counterpart in the literature.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call