Neural labeled LDA: a topic model for semi-supervised document classification

Han Yang,Xinhua Suo,Yaosen Chen,Yan Shen,Bing Guo,Wei Wang

doi:10.1007/s00500-021-06310-2

Abstract

Recently, some statistical topic modeling approaches based on LDA have been applied in the field of supervised document classification, where the model generation procedure incorporates prior knowledge to improve the classification performance. However, these customizations of topic modeling are limited by the cumbersome derivation of a specific inference algorithm for each modification. In this paper, we propose a new supervised topic modeling approach for document classification problems, Neural Labeled LDA (NL-LDA), which builds on the VAE framework, and designs a special generative network to incorporate prior information. The proposed model can support semi-supervised learning based on the manifold assumption and low-density assumption. Meanwhile, NL-LDA has a consistent and concise inference method while semi-supervised learning and predicting. Quantitative experimental results demonstrate our model has outstanding performance on supervised document classification relative to the compared approaches, including traditional statistical and neural topic models. Specially, the proposed model can support both single-label and multi-label document classification. The proposed NL-LDA performs significantly well on semi-supervised classification, especially under a small amount of labeled data. Further comparisons with related works also indicate our model is competitive with state-of-the-art topic modeling approaches on semi-supervised classification.

Highlights

Statistical topic modeling approaches(Blei, 2012), e.g., Latent Dirichlet Allocation (LDA)(Blei et al, 2003), have been widely applied in the field of data mining, latent data discovery, and document classification(Jelodar et al, 2018)
We propose a novel topic model, i.e., Neural Labeled LDA (NL-LDA), which is an extension of SLDA for semi-supervised document classification
Our model performs well in classification rate (CCR) results (Table 3). It gets the best scores among all compared algorithms, including traditional statistical topic models, i.e., Dependency-LDA and TL-LDA, as well as the neural topic model, i.e., SCHOLAR

Summary

Introduction

Statistical topic modeling approaches(Blei, 2012), e.g., Latent Dirichlet Allocation (LDA)(Blei et al, 2003), have been widely applied in the field of data mining, latent data discovery, and document classification(Jelodar et al, 2018). Standard LDA is a completely unsupervised algorithm, and how to incorporate prior knowledge into the topic modeling procedure is a popular research direction(Burkhardt and Kramer, 2019b; Chen et al, 2019). For standard LDA, the popular inference methods include variational inference(Blei et al, 2003), collapsed Gibbs sampling(Griffiths and Steyvers, 2004), and collapsed variational Bayes(Teh et al, 2006). All these methods have a drawback that requires re-deriving the inference algorithms even if there is only a small change to the modeling procedure.

Results

Discussion

Conclusion