Abstract

Clustering is an efficient data mining as well as machine-learning method when we need to get an insight of the objects of a dataset that could be grouped together. The K-Means algorithm and the Hierarchical Agglomerative Clustering (HAC) algorithm are two of the most known and commonly used methods of clustering; the former due to its low time cost and the latter due to its accuracy. However, even the use of K-Means in document clustering over large-scale collections can lead to unpredictable time costs. In this paper, towards the direction of the efficient handling of big text data, we present a hybrid clustering approach based on a customized version of the Buckshot algorithm, which first applies a hierarchical clustering procedure on a sample of the input dataset and then uses the results as the initial centers for a K-Means based assignment of the remaining documents, with very few iterations. We also give a highly efficient adaptation of the proposed Buckshot-based approach in the MapReduce model which is then experimentally tested using Apache Hadoop over a real cluster environment. As it comes out of the experiments, it leads to acceptable clustering quality as well as to significant execution time improvements. Preliminary results drawn from relevant experiments using the Spark framework are also presented.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call