Estimating Topic Modeling Performance with Sharma-Mittal Entropy.

Sergei Koltcov,Olessia Koltsova,Vera Ignatenko

doi:10.3390/e21070660

Sergei Koltcov, Olessia Koltsova + Show 1 more

Open Access

https://doi.org/10.3390/e21070660

Copy DOI

Abstract

Topic modeling is a popular approach for clustering text documents. However, current tools have a number of unsolved problems such as instability and a lack of criteria for selecting the values of model parameters. In this work, we propose a method to solve partially the problems of optimizing model parameters, simultaneously accounting for semantic stability. Our method is inspired by the concepts from statistical physics and is based on Sharma–Mittal entropy. We test our approach on two models: probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA) with Gibbs sampling, and on two datasets in different languages. We compare our approach against a number of standard metrics, each of which is able to account for just one of the parameters of our interest. We demonstrate that Sharma–Mittal entropy is a convenient tool for selecting both the number of topics and the values of hyper-parameters, simultaneously controlling for semantic stability, which none of the existing metrics can do. Furthermore, we show that concepts from statistical physics can be used to contribute to theory construction for machine learning, a rapidly-developing sphere that currently lacks a consistent theoretical ground.

Highlights

The Internet and, social networks generate a huge amount of data of different types
We demonstrate the results of simulations run on two datasets by using two Topic Modeling (TM) algorithms, namely probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation (LDA) with Gibbs sampling
We propose a method for tuning topic models, which is based on the following assertions [29,50], which create a linkage between TM and statistical physics and reformulate the problem of model parameter optimization in terms of thermodynamics: (1) A collection of documents is considered a mesoscopic information system: a statistical system where the elements are words and the documents number in the millions

Summary

Introduction

The Internet and, social networks generate a huge amount of data of different types (such as images, texts, or table data). A large amount of collected data becomes comparable to physical mesoscopic systems. It becomes possible to use machine learning methods based on methods of statistical physics to analyze such data. Topic Modeling (TM) is a popular machine learning approach to soft clustering of textual or visual data, the purpose of which is to define the set of hidden distributions in texts or images and to sort the data according to these distributions. A relatively large number of probabilistic topic models with different methods of determining hidden distributions have been developed, and several metrics for measuring the quality of topic modeling results have been formulated and investigated. The lion’s share of the research on TM has focused on the use of probabilistic models [1] such as variants of Latent Dirichlet Allocation (LDA) and probabilistic Latent

Objectives

Methods

Results

Conclusion