Abstract

Over the past two decades, a number of advances in topic modeling have produced sophisticated models that are capable of generating topic hierarchies. In particular, hierarchical Latent Dirichlet Allocation (hLDA) builds a topic tree based on the nested Chinese Restaurant Process (nCRP) or other sampling processes to generate a topic hierarchy that allows arbitrarily large branch structures and adaptive dataset growth. In addition, hierarchical topic models based on the latent tree model, such as Hierarchical Latent Tree Analysis (HLTA), have been developed over the last five years. However, these models do not work well in cases with millions of documents and hundreds of thousands of terms. In addition, the topic trees generated by these models are always poorly interpretable, and the relationships among topics in different levels are relatively simple. The biomedical literature, including Medline abstracts, has large-scale documents in two major categories: biological laboratory research and medical clinical research. We propose a top-down binary hierarchical topic model (biHTM) for biomedical literature by iteratively applying a flat topic model and adaptively processing subtrees of the hierarchy. The biHTM topic hierarchy of complete Medline abstracts with more than 14 topic node levels shows good bimodality and interpretability. Compared to hLDA and HLTA, biHTM shows promising results in experiments assessed in terms of runtime and quality.

Highlights

  • In the last two decades, the biomedical literature has grown exponentially, which has created an enormous challenge for life science researchers and healthcare professionals attempting to stay up to date with developments in their field [1]

  • We propose a top-down binary hierarchical topic model for the biomedical literature by iteratively applying a flat topic model, and adaptively processing the subtrees of the hierarchy

  • To efficiently mine large-scale medical literature corpora, we proposed an adaptive top-down binary hierarchical topic model, called biHTM, and the resulting biHTM topic tree of Medline abstracts supported our conjecture about the binary characteristics of the biomedical literature

Read more

Summary

INTRODUCTION

In the last two decades, the biomedical literature has grown exponentially, which has created an enormous challenge for life science researchers and healthcare professionals attempting to stay up to date with developments in their field [1]. Lin et al.: A Top-down Binary Hierarchical Topic Model for Biomedical Literature information These models need to maintain a full topic hierarchy in each sampling iteration, and it usually takes at least 100 iterations to yield a stable hLDA topic tree. We propose a top-down binary hierarchical topic model (biHTM) for the biomedical literature by iteratively applying a flat topic model (such as LDA), and adaptively processing the subtrees of the hierarchy. This method is a heuristic generative method, different from other probabilistic generative methods, and it could quickly generate the topic hierarchy from top to bottom without using latent variables.

RELATED WORK
LEARNING PROCESS
COMPARISON
INTERPRETABILITY
CONCLUSION AND FUTURE WORK
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call