Abstract

Stream clustering methods, which group continuous, temporally ordered dynamic data instances, have been used in a number of applications such as stock market analysis, network analysis, and cosmological analysis. Most of the popular stream clustering algorithms are linear in nature, i.e. they assume that the data is linearly separable in the input space and use measures such as the Euclidean distance to define the inter-point similarity. Though these linear clustering algorithms are efficient, they do no achieve acceptable cluster quality on real-world data. Kernel-based clustering algorithms, which use non-linear similarity measures, yield better cluster quality, but are unsuitable for clustering data streams due to their high running time and memory complexity. We propose an efficient kernel-based clustering algorithm, called the Approximate Stream Kernel k-means, which uses importance sampling to sample a subset of the data stream, and clusters the entire stream based on each data point's similarity to the sampled data points in real-time. Every time a data point is sampled, the kernel matrix representing the similarity between the sampled points is updated, and projected to a low dimensional space spanned by its top eigenvectors. The data points are clustered in this low-dimensional space using the k-means algorithm. Thus, the Approximate Stream Kernel k-means algorithm performs clustering in linear time using kernel-based similarity. We show that only a small number of points need to be sampled from the data stream, and the resulting approximation error is well-bounded. Using several large benchmark data sets to simulate data streams, we demonstrate that the proposed algorithm achieves a significant speedup over other kernel-based clustering algorithms with minimal loss in cluster quality. We also demonstrate the practical applicability of the proposed algorithm by using it to efficiently find trending topics in the Twitter stream data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call