Abstract

Although topic models have been used to build clusters of documents for more than ten years, there is still a problem of choosing the optimal number of topics. The authors analyzed many fundamental studies undertaken on the subject in recent years. The main problem is the lack of a stable metric of the quality of topics obtained during the construction of the topic model. The authors analyzed the internal metrics of the topic model: coherence, contrast, and purity to determine the optimal number of topics and concluded that they are not applicable to solve this problem. The authors analyzed the approach to choosing the optimal number of topics based on the quality of the clusters. For this purpose, the authors considered the behavior of the cluster validation metrics: the Davies Bouldin index, the silhouette coefficient, and the Calinski-Harabaz index. A new method for determining the optimal number of topics proposed in this paper is based on the following principles: (1) Setting up a topic model with additive regularization (ARTM) to separate noise topics; (2) Using dense vector representation (GloVe, FastText, Word2Vec); (3) Using a cosine measure for the distance in cluster metric that works better than Euclidean distance on vectors with large dimensions. The methodology developed by the authors for obtaining the optimal number of topics was tested on the collection of scientific articles from the OnePetro library, selected by specific themes. The experiment showed that the method proposed by the authors allows assessing the optimal number of topics for the topic model built on a small collection of English documents.

Highlights

  • Topic models have been used successfully for clustering texts for many years

  • The metrics obtained based on entropy make it possible to find a minimum depending on the number of topics for large collections, but in practice we rarely find small collections of documents

  • We show that the existing internal metrics of the topic model are not suitable for determining the optimal number of topics

Read more

Summary

Introduction

Topic models have been used successfully for clustering texts for many years. One of the most common approaches to topic modeling is the Latent Dirichlet Allocation (LDA) [1]. To evaluate a complete set of topics, researchers usually look at the perplexity [9] for the corpus of documents This approach does not work very well according to the results of studies [10,11] because the perplexity does not have an absolute minimum, and with increasing iterations, it becomes asymptotic [12]. Note that all three models described (LDA, HDP, hLDA) add a new set of parameters that require optimization, as is noted in the study [17]. Topic models derived from large collections of texts are considered as non-equilibrium complex systems, where the number of topics is considered as the equivalent of temperature This allows us to calculate the free energy of such systems—the value through which the Renyi and Tsallis entropies are expressed.

Research Methodology
Experiment
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call