Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.

Riki Murakami,Basabi Chakraborty

doi:10.3390/s22030852

Abstract

With the rapid proliferation of social networking sites (SNS), automatic topic extraction from various text messages posted on SNS are becoming an important source of information for understanding current social trends or needs. Latent Dirichlet Allocation (LDA), a probabilistic generative model, is one of the popular topic models in the area of Natural Language Processing (NLP) and has been widely used in information retrieval, topic extraction, and document analysis. Unlike long texts from formal documents, messages on SNS are generally short. Traditional topic models such as LDA or pLSA (probabilistic latent semantic analysis) suffer performance degradation for short-text analysis due to a lack of word co-occurrence information in each short text. To cope with this problem, various techniques are evolving for interpretable topic modeling for short texts, pretrained word embedding with an external corpus combined with topic models is one of them. Due to recent developments of deep neural networks (DNN) and deep generative models, neural-topic models (NTM) are emerging to achieve flexibility and high performance in topic modeling. However, there are very few research works on neural-topic models with pretrained word embedding for generating high-quality topics from short texts. In this work, in addition to pretrained word embedding, a fine-tuning stage with an original corpus is proposed for training neural-topic models in order to generate semantically coherent, corpus-specific topics. An extensive study with eight neural-topic models has been completed to check the effectiveness of additional fine-tuning and pretrained word embedding in generating interpretable topics by simulation experiments with several benchmark datasets. The extracted topics are evaluated by different metrics of topic coherence and topic diversity. We have also studied the performance of the models in classification and clustering tasks. Our study concludes that though auxiliary word embedding with a large external corpus improves the topic coherency of short texts, an additional fine-tuning stage is needed for generating more corpus-specific topics from short-text data.

Highlights

Due to the rapid developments of computing and communication technologies and the widespread use of internet, people are gradually becoming accustomed to communicating through various online social platforms, such as microblogs, Twitter, webpages, Facebook, etc
We found that pretrained word embedding enhances the topic coherence of short texts that are similar to long and formal texts, the generated topics were often comprised of words having common meanings instead of the particular short-text-specific semantics of the word, which is especially important for real-world datasets
The simulation experiments have been performed with several benchmark datasets, and the performance of the topic models are evaluated by topic coherence and topic diversity measures

Summary

Introduction

Due to the rapid developments of computing and communication technologies and the widespread use of internet, people are gradually becoming accustomed to communicating through various online social platforms, such as microblogs, Twitter, webpages, Facebook, etc. In the area of traditional natural language processing, a topicmodeling algorithm is considered an effective technique for the semantic understanding of text documents Conventional topic models, such as pLSA [1] or LDA [2] and their various variants, are considerably good at extracting latent semantic structures from a text corpus without prior annotations and are widely used in emerging topic detection, document classification, comment summarizing, or event tracking. In these models, documents are viewed as a mixture of topics, while each topic is viewed as a particular distribution over all the words. The efficient capture of document-level word co-occurrence patterns leads to the success of topic modeling

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Sensors	Publication Date: Jan 23, 2022
Citations: 14	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors

Lead the way for us

Similar Papers

Use of Neural Topic Models in conjunction with Word Embeddings to extract meaningful topics from short texts
Nassera Habbat ... Houda Anoun
EAI Endorsed Transactions on Internet of Things | VOL. 8
Nassera Habbat, et. al.Nassera Habbat ... Houda Anoun
30 Sep 2022
EAI Endorsed Transactions on Internet of Things | VOL. 8

Extracting Topics with Simultaneous Word Co-occurrence and Semantic Correlation Graphs: Neural Topic Modeling for Short Texts
...
-
, et. al. ...
23 Oct 2021
23 Oct 2021

Extracting Topics with Simultaneous Word Co-occurrence and Semantic Correlation Graphs: Neural Topic Modeling for Short Texts
Yiming Wang ... Xiaotang Zhou
-
Yiming Wang, et. al.Yiming Wang ... Xiaotang Zhou
01 Jan 2020
01 Jan 2020

GLTM: A Global and Local Word Embedding-Based Topic Model for Short Texts
Wenxin Liang ... Ran Feng
IEEE Access | VOL. 6
Wenxin Liang, et. al.Wenxin Liang ... Ran Feng
01 Jan 2018
IEEE Access | VOL. 6

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Sensors