Abstract

Text clustering is a subfield of machine learning (ML) and natural language processing (NLP) that consists of grouping similar sentences or documents based on their content. However, insignificant features in the documents minimize the accuracy of information retrieval which makes it challenging for the clustering approach to efficiently cluster similar documents. In this research, the mayfly optimization algorithm (MOA) with a k-means approach is proposed for text document clustering (TDC) to effectively cluster similar documents. Initially, the data is obtained from Reuters-21678, 20-Newsgroup, and BBC sports datasets, and then pre-processing is established by stemming and stop word removal to remove unwanted phrases or words. The data imbalance approach is established using an adaptive synthetic sampling algorithm (ADASYN), then term frequency-inverse document frequency (TD-IDF) and WordNet features are employed for extracting features. Finally, MOA with the K-means technique is utilized for TDC. The proposed approach achieves better accuracy of 99.75%, 99.54%, and 98.24% when compared to the existing techniques like fuzzy rough set-based robust nearest neighbor-convolutional neural network (FRS-RNN-CNN), TopicStriker, Modsup-based frequent itemset, and rider optimization-based moth search algorithm (Modsup-Rn-MSA), hierarchical dirichlet-multinomial mixture, and multi-view clustering via consistent and specific non-negative matrix (MCCS).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.