Abstract

Advances on deep generative models have attracted significant research interest in neural topic modeling. The recently proposed Adversarial-neural Topic Model models topics with an adversarially trained generator network and employs Dirichlet prior to capture the semantic patterns in latent topics. It is effective in discovering coherent topics but unable to infer topic distributions for given documents or utilize available document labels. To overcome such limitations, we propose Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT) and its supervised version sToMCAT. ToMCAT employs a generator network to interpret topics and an encoder network to infer document topics. Adversarial training and cycle-consistent constraints are used to encourage the generator and the encoder to produce realistic samples that coordinate with each other. sToMCAT extends ToMCAT by incorporating document labels into the topic modeling process to help discover more coherent topics. The effectiveness of the proposed models is evaluated on unsupervised/supervised topic modeling and text classification. The experimental results show that our models can produce both coherent and informative topics, outperforming a number of competitive baselines.

Highlights

  • Topic models, such as Latent Dirichlet Allocation (LDA) (Blei et al, 2003), aim to discover underlying topics and semantic structures from text collections

  • Due to its interpretability and effectiveness, LDA has been extended to many Natural Language Processing (NLP) tasks (Lin and He, 2009; McAuley and Leskovec, 2013; Zhou et al, 2017)

  • A document labeled as ‘sports’ more likely belongs to topics such as ‘basketball’ or ‘football’ rather than ‘economics’ or ‘politics’. To address such limitations of Adversarial-neural Topic Model (ATM), we propose a novel neural topic modeling approach, named Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT)

Read more

Summary

Introduction

Topic models, such as Latent Dirichlet Allocation (LDA) (Blei et al, 2003), aim to discover underlying topics and semantic structures from text collections. Inspired by variational autoencoder (VAE) (Kingma and Welling, 2013), Miao et al (2016) proposed Neural Variational Document Model which interprets the latent code in VAE as topics Following this way, Srivastava and Sutton (2017) adopted the logistic normal prior rather than Gaussian to mimic the simplex properties of topic distribution. ATM was shown to be effective in discovering coherent topics, it can not be used to induce the topic distribution given a document due to the absence of a topic inference module Such limitation hinders its application to downstream tasks, such as text classification. A document labeled as ‘sports’ more likely belongs to topics such as ‘basketball’ or ‘football’ rather than ‘economics’ or ‘politics’ To address such limitations of ATM, we propose a novel neural topic modeling approach, named Topic Modeling with Cycle-consistent Adversarial Training (ToMCAT). Experimental results on unsupervised/supervised topic modeling and text classification demonstrate the effectiveness of the proposed approaches

Neural Topic Modeling
Unsupervised Style Transfer
Methodology
ToMCAT
Encoder Network E
Generator Network G
Training Objective
Training Details
Experimental Setup
Topic Modeling
Unsupervised Topic Modeling
Supervised Topic Modeling
Impact of Topic Numbers
Text Classification
Conclusion
Findings
A Discovered Topics on NYTimes
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call