Comparative Analysis of Research Papers Categorization using LDA and NMF Approaches

Deepa Gupta,Bandi Rupendra Reddy,Darukumalli Sai Tharun Reddy,Sandeep Preetham M C

doi:10.1109/nkcon56289.2022.10127059

Abstract

In the digital world, the research papers are growing exponentially with time, and it is essential to cluster the documents under their respective categories for easier identification and access. However, researchers find it relatively challenging to recognize and categorize their favorite research articles. Though this task can be achieved by putting in the human work, it would be tedious and exhaustively time-consuming. Henceforth, much research has been done in the field of topic modelling to yield accurate results with a good computation time. The main objective of this paper is to compare the two distinct yet vastly used topic modelling approaches for research paper classification, which can further group the research papers into their respective classes. The two chosen topic modeling methodologies are Non-Negative Matrix Factorization (NMF) and Latent Dirichlet allocation (LDA). This paper introduces a comparison between LDA model's performance with a relatively efficient generative model (NMF) and analyzes its performance on the dataset that consists of 1740 papers extracted from the NYC university website. In comparison, the average coherence score for the LDA method was 0.5282, with its optimal choice of topics being 22, which was slightly higher than the NMF model as it yielded a coherence score of 0.4937 with its optimal topics being 9. To enhance the categorization of LDA, clustering the optimal topics of LDA from 22 to 10 using pyLDAvis has been done. On closely comparing both the models, LDA performs slightly better than NMF with a higher confidence score.

Full Text