A comparison of the performance of latent Dirichlet allocation and the Dirichlet multinomial mixture model on short text

Jocelyn Mazarura,Alta De Waal

doi:10.1109/robomech.2016.7813155

Abstract

The expansion of the World Wide Web and the increasing popularity of microblogging websites such as Twitter and Facebook has created massive stores of textual data that is short in length. Although traditional topic models have proven to work successfully on collections of long texts such as books and news articles, they tend to produce results that are less coherent when applied to short text, such as status messages and product reviews. Over the last few decades it has become of greater relevance to analyse short texts due to the realisation that such bodies of text could potentially hold useful information. Latent Dirichlet allocation (LDA) is one of the most popular topic models and it makes the generative assumption that each document contains multiple topics in varying proportions, which is a sensible assumption about long text. On the other hand, the Dirichlet multinomial mixture model (GSDMM), a seemingly less popular topic model, assumes that a document can only belong to a single topic, which seems to be a more appropriate assumption for short text. The objective of this paper is to investigate the hypothesis that GSDMM will outperform LDA on short text, using topic coherence and stability as performance measures.

Full Text