Topic Modeling: A Consistent Framework for Comparative Studies

Ana Amaro,Fernando Bacao

doi:10.28991/esj-2024-08-01-09

Abstract

In recent years, the field of Topic Modeling (TM) has grown in importance due to the increasing availability of digital text data. TM is an unsupervised learning technique that helps uncover latent semantic structures in large sets of documents, making it a valuable tool for finding relevant patterns. However, evaluating the performance of TM algorithms can be challenging as different metrics and datasets are often used, leading to inconsistent results. In addition, many current surveys of TM algorithms focus on a limited number of models and exclude state-of-the-art approaches. This paper has the objective of addressing these issues by presenting a comprehensive comparative study of five TM algorithms across three different benchmark datasets using five different metrics. We offer an updated survey of the latest TM approaches and evaluation metrics, providing a consistent framework for comparing different algorithms while introducing state-of-the art approaches that have been disregarded in the literature. The experiments, which primarily use Context Vectors (CV) Topic Coherence as an evaluation metric, show that Top2Vec is the best-performing model across all datasets, disrupting the tendency for Latent Dirichlet Allocation to be the best performer. Doi: 10.28991/ESJ-2024-08-01-09 Full Text: PDF

Full Text