A Support System for Clustering Data Streams with a Variable Number of Clusters

Jonathan de Andrade Silva,Eduardo Raul Hruschka

doi:10.1145/2932704

Abstract

Many algorithms for clustering data streams that are based on the widely used k -Means have been proposed in the literature. Most of these algorithms assume that the number of clusters, k , is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we propose a support system that allows not only estimating the number of clusters automatically from data but also monitoring the process of the data-stream clustering. We illustrate the potential of the proposed system by means of a prototype that implements eight algorithms for clustering data streams, namely, Stream LSearch-OMR k , Stream LSearch-B k M, Stream LSearch-IOMR k , Stream LSearch-IB k M, CluStream-OMR k , CluStream-B k M, StreamKM++-OMR k , and StreamKM++−B k M. These algorithms are combinations of three state-of-the-art algorithms for clustering data streams with fixed k , namely, Stream LSearch, CluStream, and StreamKM++, with two algorithms for estimating the number of clusters, which are Ordered Multiple Runs of k -Means (OMR k ) and Bisecting k -Means (B k M). We experimentally compare the performance of these algorithms using both synthetic and real-world data streams. Analyses of statistical significance suggest that the algorithms that are based on OMR k yield the best data partitions, while the algorithms that are based on B k M are more computationally efficient. Additionally, StreamKM++−OMR k and Stream LSearch-IB k M provide the best tradeoff relationship between accuracy and efficiency.

Full Text