A Pitman-Yor Process Self-Aggregated Topic Model for Short Texts of Social Media

Yue Niu,Hongjie Zhang,Jing Li

doi:10.1109/access.2021.3113320

Abstract

In recent years, with the rapid growth of social media, short texts have been very prevalent on the internet. Due to the limited length of each short text, word co-occurrence information in this type of documents is sparse. Conventional topic models based on word co-occurrence are unable to distill coherent topics on short texts. A state-of-the-art strategy is self-aggregated topic models which implicitly aggregate short texts into latent long documents. But these models have two problems. One problem is that the number of long documents should be defined explicitly and the inappropriate number leads to poor performance. Another problem is that latent long documents may bring non-sematic word co-occurrence which brings incoherent topics. In this article, we firstly apply the Chinese restaurant process to automatically generate the number of long documents according to the scale of short texts. Then to exclude non-semantic word co-occurrence, we propose a novel probabilistic model generating latent long documents in a more semantically way. Specifically, our model employs a pitman-yor process to aggregate short texts into long documents. This stochastic process can guarantee that the distribution between short texts and long documents following a power-law distribution which can be found in social media like Twitter. Finally, we compared our method with several state-of-the-art methods on four real short texts corpus. The experiment results show that our model performs superior to other methods with the metrics of topic coherence and text classification.

Highlights

S HORT texts have become prevalent on the internet, such as titles, comments, microblogs, questions, etc
MODEL AND INFERENCE we propose our Pitman-yor Process Selfaggregated Topic Model(PYSTM) customized for short texts
State-ofthe-art self-aggregated topic models need explicitly define the number of long documents

Summary

Introduction

S HORT texts have become prevalent on the internet, such as titles, comments, microblogs, questions, etc. As they play an important role in our daily life, discovering knowledge from short texts has become an important and challenging work. Traditional topic models like LDA [2] and HDP [3] perform well on normal texts. For short texts, these models will result in poor performance [4]. To infer topics, these models need word co-occurrence information. Only a few co-occurrence information can be found due to the very short length of each text [5]. Many researchers aimed to overcome this problem [6]

Methods

Results

Conclusion