Abstract

The fast-growing Internet results in massive amounts of text data. Due to the large volume of the unstructured format of text data, extracting relevant information and its analysis becomes very challenging. Text document clustering is a text-mining process that partitions the set of text-based documents into mutually exclusive clusters in such a way that documents within the same group are similar to each other, while documents from different clusters differ based on the content. One of the biggest challenges in text clustering is partitioning the collection of text data by measuring the relevance of the content in the documents. Addressing this issue, in this work a hybrid swarm intelligence algorithm with a K-means algorithm is proposed for text clustering. First, the hybrid fruit-fly optimization algorithm is tested on ten unconstrained CEC2019 benchmark functions. Next, the proposed method is evaluated on six standard benchmark text datasets. The experimental evaluation on the unconstrained functions, as well as on text-based documents, indicated that the proposed approach is robust and superior to other state-of-the-art methods.

Highlights

  • Text document clustering has become an important and fast-growing research area, due to the massive amounts of text data produced by the Internet, social media, email and text messages, and other sources

  • The proposed method is first validated on unconstrained benchmark functions, it is applied for Text Document Clustering (TDC)

  • The performance of the proposed method is validated on 10 modern CEC2019 functions [66] and the results are compared to the original fruit-fly optimization (FFO), and other nine metaheuristicbased approaches (EHOI, EHO, SCA, SSA, grasshopper optimization algorithm (GOA), WOA, BBO, MFO, particle swarm optimization algorithm (PSO)) [67], where the simulations were conducted under similar condition and the same problem sets are used

Read more

Summary

Introduction

Text document clustering has become an important and fast-growing research area, due to the massive amounts of text data produced by the Internet, social media, email and text messages, and other sources. One crucial method in text-mining is clustering, which has the aim of automatically partition the number of documents in a finite set of homogeneous clusters (groups). All documents are similar to each other based on the content, while in different clusters, the similarity decreases. From the perspective of optimization, clustering can be presented as an NP-hard optimization problem. Metaheuristic algorithms are shown to be very efficient to solve NP-hard optimization problems and result in close-optimal solutions in a fair amount of time. Metaheuristic algorithms that are inspired by the nature can be divided into two major categories, swarm intelligence and evolutionary algorithms. A hybrid swarm intelligence algorithm is proposed. The opposition-based learning mechanism is incorporated in the hybrid method, it is combined with the traditional K-means algorithm [4], and employed for text-based document clustering

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.