Abstract

In information retrieval and text mining, document clustering is a big challenge because the amount of document collection has been increasing, day by day. The problem of clustering is NP-hard, use of meta-heuristic algorithms to solve these problems could be an effective method. When the solution space is large, traditional methods are unable to find a solution in a reasonable amount of time. K-means is a heuristic clustering algorithm, two main issues with heuristic algorithms are the early convergence and trapping in local optima. Moreover, finding the right number of clusters is one of the main drawbacks of the k-means algorithm. The correct value of k is always confusing, different researchers used different methods to solve this problem. To overcome these mentioned problems, this study presents a novel Hybrid approach for document clustering. One of the challenges in existing BH algorithm is the input data type. Recently, the algorithm was only accepting textual data. Another flaw in the existing model is that it doesn’t choose how many clusters k to form automatically, and the centroids are chosen at random in it. In this paper, we have constructed a Hybrid cluster identification approach which consists of the Elbow method and Silhouette score for cluster k identification. This paper mainly offers three novel combination of model to represent text documents, namely i) K-mean++ - BH + TF-IDF with fix k ii) K-mean++ - BH + W2V with fix k iii) Hybrid Black Hole with automated k. The proposed improvements have validated on the document clustering problem. Cluster analysis based on two evaluation measures, external (Purity) and internal measures (Silhouette score) are used to report the findings. Experiments have been carried out on the four al-phanumeric datasets (Doc50, Reuters, WebKB and News20) as well as on two numeric datasets (Iris and Wine) respectively. The complete result analysis is reported in detail with respect to each research contribution to compare the performance of the proposed algorithm with existing clustering methods. Result shows that the proposed Hybrid BH algorithm outperforms better than the existing clustering methods for all datasets. The clustering of data with and without stop words is examined; additionally, the two alternative word embedding used for data exploration in conjunction with proposed model are also evaluated. In the present study, proposed Hybrid BH algorithm handles the optimal value of k efficiently. This is one of the major contributions of the paper, concluded that Hybrid Black Hole is an effective algorithm for cluster analysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call