Topic Model—Machine Learning Classifier Integrations on Geocoded Twitter Data

Gillian Kant,Benjamin Säfken,Christoph Weisser,Thomas Kneib

doi:10.1007/978-3-031-08580-2_11

Abstract

Topic Models are unsupervised probabilistic models that are used to explore the hidden semantic structure of a corpus of documents by generating latent topics. Latent topics are discrete distribution over words that need to be interpreted and labeled by humans. The latent topics can be also considered as a summary of the various topics that are discussed in the corpus of documents, so that the topic model can also serve as a dimension reduction method. Latent topics generated by topic models are difficult to measure in their quality [4] and their outputs are commonly used in a descriptive manner in order to gain topical insights into the topic distributions of chosen documents. A so far unexplored way, is the use of labeled data to assess the informational content of latent topics. In this paper, topic models are examined in terms of their informative value regarding classification problems for labeled text data. We use geo-coded social media posts from Twitter, but our approach can be expanded to other labeled documents. The output of Latent Dirichlet Allocation (LDA) [3] models and Structural Topic Models (STM) [29] are used as an input for machine learning classifiers, after pooling tweets based on the hashtag pooling algorithm of [24]. Their predictive power is compared with the performance of state-of-the-art Artificial Neural Networks (ANN) that are trained on a specifically optimized word-embedding and all the available metadata of the tweets. We find that the machine learning classifiers that are trained on topics can compete with the predictive performance of the ANNs, even for out-of-sample predicted topic distributions.

Full Text