The process of automatically grouping documents into clusters such that the documents in one cluster are very comparable to the documents in the remaining clusters have been known as document clustering. Due to its broad application in a number of fields, including search engines, web mining, and information retrieval, it has been the subject of much research. It involves clustering documents that are identical to one another and calculating how identical they are. It facilitates simple navigation by offering effective document representation as well as visualization. Hence, this research paper plans to perform the document clustering using the nature inspired optimization technique. Initially, the dataset is manually gathered from different sources. Next, the data preparation has been done for extracting the text content from the published documents. These prepared data undergo pre-processing for removing the punctuations, stop words, and lowercase conversion. The features are extracted from these pre-processed data utilizing the Term Frequency-Inverse Document Frequency (TF-IDF) approach for extracting the keywords. The extracted features undergo the final clustering phase employing the spectral clustering algorithm, in which the parameter tuning has been done by the nature inspired optimization algorithm referred as Particle Swarm Optimization (PSO) with the consideration of silhouette score maximization as the objective function. This proposed spectral clustering-PSO clusters the final output into six classes such as data mining, deep learning, image, machine learning, network, and sports respectively. The proposed document clustering model describes its betterment over the remaining techniques with respect to distinct measures. The proposed spectral clustering-PSO in terms of silhouette score is 51.92%, 70.81%, 45.93%, and 20.89% better than JA-GWO, tpLDA, HDMA, and Net2Vec respectively. Similarly, the proposed spectral clustering-PSO in terms of davies bouldin score is 89.69%, 58.48%, 32.67%, and 13.99% advanced than JA-GWO, tpLDA, HDMA, and Net2Vec respectively.
Read full abstract