Abstract

Many algorithms for clustering data streams that are based on the widely used k -Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k , is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMR k , Stream LSearch-B k M, Stream LSearch-IOMR k , Stream LSearch-IB k M, CluStream-OMR k , CluStream-B k M, StreamKM++-OMR k , and StreamKM++−B k M. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k , namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k -Means (OMR k ) and Bisecting k -Means (B k M). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMR k yield the best data partitions, while the algorithms that are based on B k M are more computationally efficient. Additionally, StreamKM++−OMR k and Stream LSearch-IB k M provide the best tradeoff relationship between accuracy and efficiency.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call