Abstract

The automatic topic extraction (TE) from scientific publications provides a very compact summary of the clusters’ contents. This often helps in locating information easily. TE enables us to define the boundaries of the scientific fields. Text Document Clustering (TDC) represents, in general, the first step of topic identification to identify the documents, which address a related subject matter. Metaheuristics are typically used as efficient approaches for TDC. The multi-verse optimizer algorithm (MVO) involves a stochastic population-based algorithm. It has been recently proposed and successfully utilized to tackle many hard optimization problems. In the TE process, the focus of each statistical TE method is placed on various language feature space aspects. The aim of this paper is to design a novel ensemble method for an automatic TE from a collection of scientific publications based on MVO as the clustering algorithm. The automatic TE, which is used in our approach, is term frequency-inverse document frequency (TF-IDF), most frequent based keyword extraction (TF), co-occurrence statistical information-based keyword extraction (CSI), TextRank (TR), and mutual information (MI). A group of candidate topics can be provided by each automatic TE method for the proposed ensemble method. Next, the ensemble approach prunes the candidate topics’ set via the application of a specific filtering heuristic. Then, their scores are recalculated based on the prescribed metrics. After that, for selecting a set of topics for certain scientific publications, dynamic threshold functions are applied. The findings emphasized the refined candidate set’s efficiency, as well as effectiveness. The results also showed that the system’s quality has been improved by new topics. The proposed method achieved better precision, as well as recall on a similar dataset compared to the state-of-the-art TE methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call