Abstract

A data streams is a sequence of dynamic, continuous, unbounded and real time data items with a very high data rate that can only be read once. In data mining, clustering is one of useful techniques for discovering interesting data in the underlying data objects. The problem of clustering can be defined formally as follows: given n data points in the d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters. In the data streams environment, the difficulties of data streams clustering contain storage overhead, low clustering quality and a low updating efficiency. Therefore, in this paper, we present a new clustering algorithm with high quality, GDense, for data streams. The GDense algorithm has high quality due to two kinds of partition: cells and quadcells, and two kinds of threshold: δ and (1/4)δ. From our simulation results, no matter what condition (including the number of data points, the number of cells, the size of the sliding window, and the threshold of dense cell) is, the clustering purity of our GDense algorithm is always higher than that of the CDS-Tree algorithm.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.