Abstract

Stream clustering is a standout amongst the most imperative fields in machine learning. Traditional unsupervised clustering tasks have been normally carried out in batch mode where data could be somehow fitted in memory and therefore several passes on the data are allowed. However the new Big Data paradigm has created a new environment where data can be potentially non-finite and arrive continuously. Such streams of data can reach computing systems at high speeds and contain data generation processes which might be non-stationary. For clustering tasks, this implies inconceivability to store all information in memory and obscure number and size of clusters. Noise levels can also be high due to either data generation or transmission. All these factors make traditional clustering methods not suitable to cope. As a consequence, stream clustering has emerged as a field of intense research with the aim of tackling these challenges. Clustream is one of the most advanced state of the art stream clustering algorithm. It normally requires two phases: first online micro-clustering phase, where statistics are gathered describing the incoming data; and a second offline macro-clustering phase, where a conventional non-stream clustering algorithm is executed using the high level statistics resulting from the online step. Because of its design, it requires expert-level parametrization or suffers from low runtime performance or has high sensitivity to noise or degrade considerably in high dimensional spaces because of their offline step. We propose a new stream clustering algorithm, the Clustream-hybrid based on Clustream clustering principles. It extends the same process used in Clustream but uses k-means++ instead of k-means in macro-clustering phase enabling it to accomplish quick runtime calculation while additionally keeping accuracy in high dimensional settings. We integrate it in MOA (Massive Online Analysis) tool. We evaluated the results with nine clustering quality metrics and compared the performance with Clustream for both synthetic and real data sets. The results are encproposedaging, outperforming in most of the cases in quality metrics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call