Abstract
In a non-stationary environment, newly received data may have different knowledge patterns from the data used to train learning models. As time passes, a learning model’s performance may become increasingly unreliable. This problem is known as concept drift and is a common issue in real-world domains. Concept drift detection has attracted increasing attention in recent years. However, very few existing methods pay attention to small regional drifts, and their accuracy may vary due to differing statistical significance tests. This paper presents a novel concept drift detection method, based on regional-density estimation, named nearest neighbor-based density variation identification (NN-DVI). It consists of three components. The first is a k-nearest neighbor-based space-partitioning schema (NNPS), which transforms unmeasurable discrete data instances into a set of shared subspaces for density estimation. The second is a distance function that accumulates the density discrepancies in these subspaces and quantifies the overall differences. The third component is a tailored statistical significance test by which the confidence interval of a concept drift can be accurately determined. The distance applied in NN-DVI is sensitive to regional drift and has been proven to follow a normal distribution. As a result, the NN-DVI’s accuracy and false-alarm rate are statistically guaranteed. Additionally, several benchmarks have been used to evaluate the method, including both synthetic and real-world datasets. The overall results show that NN-DVI has better performance in terms of addressing problems related to concept drift-detection.
Highlights
As technology advances, it has become increasingly easier to collect and organize data from different sources
In this research, according to Definition 13, we prove that the dnnps of two i.i.d. sample sets fits a normal distribution
Evaluating the nearest neighbor-based density variation identification (NN-DVI) on Real-world Datasets To demonstrate how our drift detection algorithm improves the performance of learning models in real-world scenarios, we compared our detection method with the other two closely related algorithms, 1) KL [14] and 2) CM [46]), on five benchmark real-world concept drift data sets
Summary
It has become increasingly easier to collect and organize data from different sources. The term concept drift in machine learning field refers to a phenomenon in knowledge patterns where data distribution continues to change over time [41]. In real-world scenarios, these types of changes are barely perceptible [46, 45] For this reason, instead of making an assumption in a stationary environment, an effective learning model must always be alert to concept drift, and track and adapt to them quickly [18, 28, 47]. Category 1) methods actively detect concept drifts at every time step and react after confirming a drift They can be further divide into three subcategories [46] : a) data distribution-based drift detection, b) learner output-based drift detection and c) learner parameter-based drift detection.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have