Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Alireza Abbasi,Hussein A. Abbass,Sahand Vahidnia

doi:10.2478/jdis-2021-0024

Alireza Abbasi, Hussein A. Abbass + Show 1 more

Open Access

https://doi.org/10.2478/jdis-2021-0024

Copy DOI

Abstract

Abstract Purpose Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem. Design/methodology/approach To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications. Findings Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics. Research limitations We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited. Practical implications As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics. Originality/value In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

Highlights

Detection of research fields or topics in science and their dynamics overtime is an active field of research
This field falls under the field of Science of Science (SciSci), aiming to understand, quantify and predict scientific research dynamics and the drivers of that dynamics in different forms such as the birth and death of scientific fields and/or their sub-fields (Zeng et al, 2017) that can be identified by tracking the changes of research trends
The automation of this process known as topic detection, gives the scientists a faster way to discover, summarize, and represent the research topics based on a large corpus of documents, independent of the subjective opinions, as opposed to reviewing thousands of documents

Summary

Introduction

Detection of research fields or topics in science and their dynamics overtime is an active field of research This field falls under the field of Science of Science (SciSci), aiming to understand, quantify and predict scientific research dynamics and the drivers of that dynamics in different forms such as the birth and death of scientific fields and/or their sub-fields (Zeng et al, 2017) that can be identified by tracking the changes of research trends. This helps governments, businesses, and scientists in their decisions regarding establishing the fields of science and investment in the fields, contributing to the research budgets. After recent developments in machine learning and natural language processing (NLP), new methods in text mining such as word and document embeddings have facilitated analyzing the metadata (Joulin et al, Mikolov et al, 2013) or contents of publications to understand the dynamics of the fields (Zhang et al, 2017)

Objectives

Methods

Conclusion