Multi-source Neural Topic Modeling in Multi-view Embedding Spaces

Yatin Chaudhary,Hinrich Schütze,Pankaj Gupta

doi:10.18653/v1/2021.naacl-main.332

Abstract

Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embed ding spaces: (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context-insensitive from Glove and context-sensitive from BERT models) jointly from one or many sources to improve topic quality and better deal with polysemy. In doing so, we first build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). We then identify one or more relevant source domain(s) and transfer knowledge to guide meaningful learning in the sparse target domain. Within neural topic modeling, we quantify the quality of topics and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. Introducing the multi-source multi-view embedding spaces, we have shown state-of-the-art neural topic modeling using 6 source (high-resource) and 5 target (low-resource) corpora.

Highlights

Probabilistic topic models, such as LDA (Blei et al, 2003), Replicated Softmax (RSM) (Salakhutdinov and Hinton, 2009) and Document Neural Autoregressive Distribution Estimator (DocNADE) (Larochelle and Lauly, 2012) are often used to extract topics from text collections and learn latent document representations to perform natural language processing tasks, such as information retrieval (IR)
No prior work in topic modeling has employed multi-view embedding spaces: (1) pretrained topics, i.e., topical embeddings obtained from large document collections, and (2) pretrained contextualized word embeddings from large-scale language models like BERT (Devlin et al, 2019)
Word embeddings have primarily local view in the sense that they are learned based on local collocation pattern in a text corpus, where the representation of each word often depends on a local context window (Mikolov et al, 2013b) or is a function of its sentence(s) (Peters et al, 2018)

Summary

Introduction

Probabilistic topic models, such as LDA (Blei et al, 2003), Replicated Softmax (RSM) (Salakhutdinov and Hinton, 2009) and Document Neural Autoregressive Distribution Estimator (DocNADE) (Larochelle and Lauly, 2012) are often used to extract topics from text collections and learn latent document representations to perform natural language processing tasks, such as information retrieval (IR). Word embeddings have primarily local view in the sense that they are learned based on local collocation pattern in a text corpus, where the representation of each word often depends on a local context window (Mikolov et al, 2013b) or is a function of its sentence(s) (Peters et al, 2018) They are not aware of the thematic structures underlying the document collection. We evaluate the effectiveness of multi-source neural topic modeling in multi-view embedding spaces using 7 (5 low-resource and 2 high-resource) target and 5 (high-resource) source corpora from news and medical domains, consisting of shorttext, long-text, small and large document collections. Embedding lookups columns: Word embeddings rows: Topic embeddings v1 ... vi-1 vi vD visible units, v ε {1, ..., K}D

Knowledge-Aware Topic Modeling

Neural Autoregressive Topic Models

13: Overall loss with controlled topic-imitation

MVT and MST in Neural Topic Modeling

Evaluation and Analysis

Generalization

Interpretabilty

Applicability

A Data Description

C Experimental Setup

Experimental Setup for Generalization

Experimental Setup for IR Task