An Efficient Document Clustering Approach for Devising Semantic Clusters

E K Jasila,N Saleena,K A Abdul Nazeer

doi:10.1080/01969722.2023.2175135

Abstract

The rise of superfluous information day by day has made the clustering of information into meaningful sets challenging. We propose an efficient approach for obtaining semantic clusters from a huge volume of documents. The preprocessing based on the lexical ontological information from WordNet helps in reducing the feature space and eliminating synonymy problems among the features. A considerable decrease in computational time is achieved by means of an enhanced k-means clustering algorithm. This algorithm computes the starting centroids using a sorting technique based on Red-Black Tree, thus ensuring efficiency and meaningful clusters. Memoization techniques are utilized in the subsequent stages, to avoid redundant computations. Results indicate that our method produces more meaningful clusters than those employing word embedding models like Word2Vec, FastText, and BERT for feature extraction. The experimental results on the MiniNewsGroup, 20NewsGroup Large dataset, and Reuters-21578 dataset show remarkable achievement in clustering outcomes in terms of purity and execution time. Results on the enormous data collection 20NewsGroup Large dataset show a better NMI (Normalized Mutual Information) score compared to the existing methods.

Full Text