Abstract

An innovative model-based approach to coupling text clustering and topic modeling is introduced, in which the two tasks take advantage of each other. Specifically, the integration is enabled by a new generative model of text corpora. This explains topics, clusters and document content via a Bayesian generative process. In this process, documents include word vectors, to capture the (syntactic and semantic) regularities among words. Topics are multivariate Gaussian distributions on word vectors. Clusters are assigned corresponding topic distributions as their semantics. Content generation is ruled by text clusters and topics, which act as interacting latent factors. Documents are at first placed into respective clusters, then the semantics of these clusters is then repeatedly sampled to draw document topics, which are in turn sampled for word-vector generation.Under the proposed model, collapsed Gibbs sampling is derived mathematically and implemented algorithmically with parameter estimation for the simultaneous inference of text clusters and topics.A comparative assessment on real-world benchmark corpora demonstrates the effectiveness of this approach in clustering texts and uncovering their semantics. Intrinsic and extrinsic criteria are adopted to investigate its topic modeling performance, whose results are shown through a case study. Time efficiency and scalability are also studied.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.