Abstract

Topic modeling has emerged as a successful approach to uncovering topics from textual data. Various topic modeling techniques have been introduced, ranging from traditional algorithms to those based on neural networks. In this research, we explore advanced topic modeling techniques, including BERT-based approaches, to enhance the analysis of scientific articles. We first investigate a widely used Latent Dirichlet Allocation (LDA) model and then explore the capabilities of BERT, to automatically uncover latent topics within scientific papers. The goal of this study is to identify the optimal hyperparameter setting for BERT-based topic modeling of scientific articles. We conduct experiments across several scenarios involving combinations of word embedding, dimension reduction, and clustering methods. The results were analyzed based on the coherence values, average execution time, number of topics generated, visualization through the inter-topic distance map, and the top-N-words of each topic. Our findings suggest that combination of RoBERTa for word embedding, PCA for dimension reduction, and K-Means for clustering yields superior results among the tested scenarios. Further evaluation of BERT-based topic modeling is necessary to validate these findings and explore its applications in various academic and industrial contexts. The implications of these advanced techniques could significantly streamline the process of staying updated with scientific literature, potentially revolutionizing research methodologies across disciplines.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.