How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Daniel Maier,Gregor Wiedemann,Daniela Stoltenberg,Andreas Niekler

doi:10.5117/ccr2020.2.001.maie

Abstract

Abstract Topic modeling enables researchers to explore large document corpora. Large corpora, however, can be extremely costly to model in terms of time and computing resources. In order to circumvent this problem, two techniques have been suggested: (1) to model random document samples, and (2) to prune the vocabulary of the corpus. Although frequently applied, there has been no systematic inquiry into how the application of these techniques affects the respective models. Using three empirical corpora with different characteristics (news articles, websites, and Tweets), we systematically investigated how different sample sizes and pruning affect the resulting topic models in comparison to models of the full corpora. Our inquiry provides evidence that both techniques are viable tools that will likely not impair the resulting model. Sample-based topic models closely resemble corpus-based models if the sample size is large enough (> 10,000 documents). Moreover, extensive pruning does not compromise the quality of the resultant topics.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Communication Research	Publication Date: Oct 1, 2020
Citations: 17	License type: cc-by

R Discovery Prime

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Abstract

Published Version

Talk to us

Similar Papers

More From: Computational Communication Research

Lead the way for us

Similar Papers

Scalable topical phrase mining from text corpora
Ahmed El-Kishky ... Yanglei Song
Proceedings of the VLDB Endowment | VOL. 8
Ahmed El-Kishky, et. al.Ahmed El-Kishky ... Yanglei Song
01 Nov 2014
Proceedings of the VLDB Endowment | VOL. 8

Gender Bias in the News: A Scalable Topic Modelling and Visualization Framework.
Prashanth Rao ... Maite Taboada
Frontiers in Artificial Intelligence | VOL. 4
Prashanth Rao, et. al.Prashanth Rao ... Maite Taboada
16 Jun 2021
Frontiers in Artificial Intelligence | VOL. 4

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts
Wenxin Liang ... Yuangang Li
IEEE Access | VOL. 6
Wenxin Liang, et. al.Wenxin Liang ... Yuangang Li
01 Jan 2018
IEEE Access | VOL. 6

Proposed Model for Context Topic Identification of English and Hindi News Article Through LDA Approach with NLP Technique
Anukriti Srivastav ... Satwinder Singh
Journal of The Institution of Engineers (India): Series B | VOL. 103
Anukriti Srivastav, et. al.Anukriti Srivastav ... Satwinder Singh
14 Aug 2021
Journal of The Institution of Engineers (India): Series B | VOL. 103

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

How Document Sampling and Vocabulary Pruning Affect the Results of Topic Models

Abstract

Published Version

Talk to us

Similar Papers

More From: Computational Communication Research