Abstract

Most existing stream clustering algorithms adopt the online component and offline component. The disadvantage of two-phase algorithms is that they can not generate the final clusters online and the accurate clustering results need to be got through the offline analysis. Furthermore, the clustering algorithms for uncertain data streams are incompetent to find clusters of arbitrary shapes according to the varieties of uncertain data streams. To address this issue, this paper proposes a novel algorithm PDG-OCUStream, Probability Density Grid-based Online Clustering for Uncertain Data Streams, in which the summary information of uncertain data streams is stored in the probability density grid with relative statistical values. By setting the probability density threshold, clustering quality can be effectively controlled, and probability density grid structure is easy to be maintained and updated, so it can improve the efficiency of online clustering. In this paper we also use the count-based sliding window, which reflects the current situation of the uncertain data stream. System resources can be effectively saved by adjusting the step of sliding window. In addition, this paper defines grid probability density similarity to achieve initializing and updating clusters according to merging connected probability density grids, so the algorithm can distinguish between dense regions and sparse regions, and quickly find the clusters in the data distribution in real time. The experimental results show that PDG-OCUStream algorithm has fast online clustering capability while ensuring a good clustering quality.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call