Abstract

Hierarchical topic models, such as hierarchical Latent Dirichlet Allocation (hLDA)and its variations, can organize topics into a hierarchy automatically. On the other hand, there are lots of documents associated with hierarchical label information. Incorporating these information into the topic modeling process can help users to obtain a more reasonable hierarchical structure. However, after analyzing various real-world datasets, we find that these hierarchical labels are ambiguous and conflicting in some levels, which introduces error and restriction to the latent topic and the hierarchical structure exploration process. We call it the horizontal topic expansion problem. To address this problem, in this paper, we propose a novel hierarchical topic model named horizontal and vertical hierarchical topic model (HV-HTM), which aims to incorporate the observed hierarchical label information into the topic generation process, while keeping the flexibility of horizontal and vertical expansion of the hierarchical structure in the modeling process. We conduct experiments on BBC news and Yahoo! Answers datasets and evaluate the effectiveness of HV-HTM on three evaluation metrics. The experimental results show that HV-HTM has a significant improvement on topic modeling, compared to the state-of-the-art models, and it can also obtain a more interpretable hierarchical structure.

Highlights

  • Topic modeling is one of the most popular research areas in Natural Language Process (NLP), which aims at digging out the latent topics from a large collection of documents

  • We focus on hierarchical topic models incorporating observed hierarchical label information and how to expand the topical tree horizontally and vertically

  • The runtime and memory usage of hierarchical Latent Dirichlet Allocation (hLDA) show a dramatic increase, much more than that of Supervised Hierarchical Latent Dirichlet Allocation (SSHLDA) and horizontal and vertical hierarchical topic model (HV-HTM). These results indicate that HV-HTM has the same running performance as SSHLDA and is much better than hLDA

Read more

Summary

Introduction

Topic modeling is one of the most popular research areas in Natural Language Process (NLP), which aims at digging out the latent topics from a large collection of documents. Topic models, such as Latent Dirichlet allocation (LDA) [1], have been proven to be useful in extracting latent topics. Hierarchical topic models, like hierarchical Latent Dirichlet Allocation (hLDA) [2], are proposed to relax this restriction. Those models make use of the Chinese restaurant process and

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call