Abstract

Topic extraction is an essential task in bibliometric data analysis, data mining and knowledge discovery, which seeks to identify significant topics from text collections. The conventional topic extraction schemes require human intervention and involve also comprehensive pre-processing tasks to represent text collections in an appropriate way. In this paper, we present a two-stage framework for topic extraction from scientific literature. The presented scheme employs a two-staged procedure, where word embedding schemes have been utilized in conjunction with cluster analysis. To extract significant topics from text collections, we propose an improved word embedding scheme, which incorporates word vectors obtained by word2vec, POS2vec, word-position2vec and LDA2vec schemes. In the clustering phase, an improved clustering ensemble framework, which incorporates conventional clustering methods (i.e., k-means, k-modes, k-means++, self-organizing maps and DIANA algorithm) by means of the iterative voting consensus, has been presented. In the empirical analysis, we analyze a corpus containing 160,424 abstracts of articles from various disciplines, including agricultural engineering, economics, engineering and computer science. In the experimental analysis, performance of the proposed scheme has been compared to conventional baseline clustering methods (such as, k-means, k-modes, and k-means++), LDA-based topic modelling and conventional word embedding schemes. The empirical analysis reveals that ensemble word embedding scheme yields better predictive performance compared to the baseline word vectors for topic extraction. Ensemble clustering framework outperforms the baseline clustering methods. The results obtained by the proposed framework show an improvement in Jaccard coefficient, Folkes & Mallows measure and F1 score.

Highlights

  • Topic extraction is an essential task in bibliometric data analysis, data mining and information retrieval, which aims to identify significant topics from text collections

  • The presented scheme employs a two-staged procedure, where word embedding schemes have been utilized in conjunction with cluster analysis

  • Motivated by the predictive performance of cluster ensembles for cluster analysis and the predictive performance enhancement obtained by improved word embedding schemes for several tasks of natural language processing, this paper presents a hybrid topic extraction framework based on improved word embeddings and cluster ensemble

Read more

Summary

Introduction

Topic extraction is an essential task in bibliometric data analysis, data mining and information retrieval, which aims to identify significant topics from text collections. Topic extraction from scientific literature can be especially essential for exploratory data analysis to get a quick overview of the contents of a collection and to find information objects [1]. The conventional topic extraction schemes require comprehensive pre-processing tasks to represent text collections in an appropriate way. To employ text mining methods on text collections, conventional natural language processing based preprocessing tasks, such as, the identification of synonymous terms, the identification of compound terms, term transformation based on stemming and lemmatization, stop-words and common terms removal must be employed [4]. Conventional topic extraction schemes involve human intervention. The tasks to be needed for topic extraction include the retrieval of links among citations and co-citations and synthesizing technical synonyms [5]

Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.