Abstract

Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficulties: 1) The topics of different data sources are related but not the same. 2) Usually, each data source has its own focus on topics. 3) The number of clusters of the data sources are not necessarily the same and are not known beforehand. In this paper, based on our previous research, we design a novel multi-source document clustering model, namely, the hierarchical Dirichlet multinomial allocation (HDMA) model, to solve all the above problems. The HDMA model is investigated with a two-step hierarchical topic generation process. Topics are learnt to share their general characteristics across data source, while at the same time preserve the local characteristics of the data source. Each data source is applied with an exclusive topic partition to learn the source-level topic emphasis. A Gibbs sampling algorithm is then used to learn the number of clusters for each data source as well as the parameters of the HDMA model at the same time. Experimental results demonstrate that the HDMA model is effective.

Highlights

  • As internet technology has rapidly developed, an increasing number of text documents has become available from various heterogeneous data sources

  • The hierarchical Dirichlet multinomial allocation (HDMA) model is designed based on the Dirichlet multinomial allocation (DMA) model, which is described in our previous work on the document clustering approach for a single data source [5], [13]

  • 2) EXPERIMENTAL RESULTS ON THE REAL DATA CORPORA we evaluated the experimental performance of our proposed HDMA model on two real data corpus

Read more

Summary

INTRODUCTION

As internet technology has rapidly developed, an increasing number of text documents has become available from various heterogeneous data sources. An inappropriate value of K distorts the clustering process and results in poor document clustering performance It is useful if a multi-source document clustering approach is able to learn the number of clusters K for each individual data source automatically. The HDMA model is designed based on the Dirichlet multinomial allocation (DMA) model, which is described in our previous work on the document clustering approach for a single data source [5], [13]. 3) A Gibbs sampling algorithm is developed to estimate the parameters of HDMA as well as the number of clusters K of each individual data source automatically. The remainder of this paper is organized as follows: Section II reviews the related work of multi-source document clustering.

RELATED WORK
BACKGROUND
ALGORITHM
DATASETS
EVALUATION METRICS
INFLUENCE OF HYPER-PARAMETERS
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.