A framework for streamlined statistical prediction using topic models

Vanessa Glenny,Lewis Mitchell,Jonathan Tuke,Nigel Bean

doi:10.18653/v1/w19-2508

Abstract

In the Humanities and Social Sciences, there is increasing interest in approaches to information extraction, prediction, intelligent linkage, and dimension reduction applicable to large text corpora. With approaches in these fields being grounded in traditional statistical techniques, the need arises for frameworks whereby advanced NLP techniques such as topic modelling may be incorporated within classical methodologies. This paper provides a classical, supervised, statistical learning framework for prediction from text, using topic models as a data reduction method and the topics themselves as predictors, alongside typical statistical tools for predictive modelling. We apply this framework in a Social Sciences context (applied animal behaviour) as well as a Humanities context (narrative analysis) as examples of this framework. The results show that topic regression models perform comparably to their much less efficient equivalents that use individual words as predictors.

Highlights

For the past 20 years, topic models have been used as a means of dimension reduction on text data, in order to ascertain underlying themes, or ‘topics’, from documents
This paper develops a methodology for incorporating topic models into traditional statistical regression frameworks, such as those used in the Social Sciences and Humanities, to make predictions
We derive an efficient likelihood-based method for estimating topic proportions for previously unseen documents, without the need to retrain. Given these two models hold the ‘bag of words’ assumption, we investigate the effect of introducing language structure to the model through the hidden Markov topic model (HMTM) (Andrews and Vigliocco, 2010)

Summary

Introduction

For the past 20 years, topic models have been used as a means of dimension reduction on text data, in order to ascertain underlying themes, or ‘topics’, from documents. We derive an efficient likelihood-based method for estimating topic proportions for previously unseen documents, without the need to retrain Given these two models hold the ‘bag of words’ assumption (i.e., they assume independence between words in a document), we investigate the effect of introducing language structure to the model through the hidden Markov topic model (HMTM) (Andrews and Vigliocco, 2010). The implementation of these three topic models as a dimension reduction step for a regression model provides a framework for the implementation of. Proc. of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 61–70 Minneapolis, MN, USA, June 7, 2019. c 2019 Association for Computational Linguistics further topic models, dependent on the needs of the corpus and response in question

Definitions

LDA regression model

Regression model and number of topics

Introducing new documents

HMTM regression model

Testing the topic regression models

Word count model

Topic regression models

Incorporating language structure

Discussion and further research

Findings

Text cleaning

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A framework for streamlined statistical prediction using topic models

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 15	License type: cc-by

Similar Papers

Agreeing to Disagree: Choosing Among Eight Topic-Modeling Methods
Qiang Fu ... Xin Guo
Big Data Research | VOL. 23
Qiang Fu, et. al.Qiang Fu ... Xin Guo
16 Dec 2020
Big Data Research | VOL. 23

Robustness, replicability and scalability in topic modelling
Omar Ballester ... Orion Penner
Journal of Informetrics | VOL. 16
Omar Ballester, et. al.Omar Ballester ... Orion Penner
01 Feb 2022
Journal of Informetrics | VOL. 16

Accounting for context in the use of topic models in social science: a case study exploring public discourse on coal seam gas in Australia
Angus Veitch
-
Angus VeitchAngus Veitch
29 Apr 2019
29 Apr 2019

Topic selection for text classification using ensemble topic modeling with grouping, scoring, and modeling approach
Daniel Voskergian ... Malik Yousef
Scientific Reports | VOL. 14
Daniel Voskergian, et. al.Daniel Voskergian ... Malik Yousef
09 Oct 2024
Scientific Reports | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A framework for streamlined statistical prediction using topic models

Abstract

Highlights

Summary

Talk to us

Similar Papers