Abstract

In the present digital era, vast amounts of data are generated by millions of Internet users in the form of unstructured text documents. The clustering and organizing of text documents play a crucial role in the applications of data analysis and market research. In this research manuscript, a new modified version of metaheuristic-based optimization technique is proposed with k-means for clustering the text documents. In the initial phase, the input data are acquired from the three-benchmark databases such as Reuters-21578, 20-Newsgroup and British Broadcasting Corporation (BBC)-sport. Further, the data denoising is accomplished by using the common techniques: stemming, lemmatization, tokenization, and stop word removal. In addition to this, the denoised data are transformed into feature vectors by utilizing Term Frequency (TF)-Inverse Document Frequency (IDF) technique. The computed feature vectors are given to the Modified Particle Swarm Optimization (MPSO) with k-means to group the closely related text documents by minimizing the similarity in different clusters. The experimental examination showed that the proposed MPSO with k-means model achieved accuracy of 0.85, 0.85 and 0.86 on the Reuters-21578, 20-Newsgroup and BBC-sport databases, which are superior to the comparative models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call