Abstract
Hierarchical topic models, such as hierarchical Latent Dirichlet Allocation (hLDA)and its variations, can organize topics into a hierarchy automatically. On the other hand, there are lots of documents associated with hierarchical label information. Incorporating these information into the topic modeling process can help users to obtain a more reasonable hierarchical structure. However, after analyzing various real-world datasets, we find that these hierarchical labels are ambiguous and conflicting in some levels, which introduces error and restriction to the latent topic and the hierarchical structure exploration process. We call it the horizontal topic expansion problem. To address this problem, in this paper, we propose a novel hierarchical topic model named horizontal and vertical hierarchical topic model (HV-HTM), which aims to incorporate the observed hierarchical label information into the topic generation process, while keeping the flexibility of horizontal and vertical expansion of the hierarchical structure in the modeling process. We conduct experiments on BBC news and Yahoo! Answers datasets and evaluate the effectiveness of HV-HTM on three evaluation metrics. The experimental results show that HV-HTM has a significant improvement on topic modeling, compared to the state-of-the-art models, and it can also obtain a more interpretable hierarchical structure.
Highlights
Topic modeling is one of the most popular research areas in Natural Language Process (NLP), which aims at digging out the latent topics from a large collection of documents
We focus on hierarchical topic models incorporating observed hierarchical label information and how to expand the topical tree horizontally and vertically
The runtime and memory usage of hierarchical Latent Dirichlet Allocation (hLDA) show a dramatic increase, much more than that of Supervised Hierarchical Latent Dirichlet Allocation (SSHLDA) and horizontal and vertical hierarchical topic model (HV-HTM). These results indicate that HV-HTM has the same running performance as SSHLDA and is much better than hLDA
Summary
Topic modeling is one of the most popular research areas in Natural Language Process (NLP), which aims at digging out the latent topics from a large collection of documents. Topic models, such as Latent Dirichlet allocation (LDA) [1], have been proven to be useful in extracting latent topics. Hierarchical topic models, like hierarchical Latent Dirichlet Allocation (hLDA) [2], are proposed to relax this restriction. Those models make use of the Chinese restaurant process and
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.