Abstract
Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficulties: 1) The topics of different data sources are related but not the same. 2) Usually, each data source has its own focus on topics. 3) The number of clusters of the data sources are not necessarily the same and are not known beforehand. In this paper, based on our previous research, we design a novel multi-source document clustering model, namely, the hierarchical Dirichlet multinomial allocation (HDMA) model, to solve all the above problems. The HDMA model is investigated with a two-step hierarchical topic generation process. Topics are learnt to share their general characteristics across data source, while at the same time preserve the local characteristics of the data source. Each data source is applied with an exclusive topic partition to learn the source-level topic emphasis. A Gibbs sampling algorithm is then used to learn the number of clusters for each data source as well as the parameters of the HDMA model at the same time. Experimental results demonstrate that the HDMA model is effective.
Highlights
As internet technology has rapidly developed, an increasing number of text documents has become available from various heterogeneous data sources
The hierarchical Dirichlet multinomial allocation (HDMA) model is designed based on the Dirichlet multinomial allocation (DMA) model, which is described in our previous work on the document clustering approach for a single data source [5], [13]
2) EXPERIMENTAL RESULTS ON THE REAL DATA CORPORA we evaluated the experimental performance of our proposed HDMA model on two real data corpus
Summary
As internet technology has rapidly developed, an increasing number of text documents has become available from various heterogeneous data sources. An inappropriate value of K distorts the clustering process and results in poor document clustering performance It is useful if a multi-source document clustering approach is able to learn the number of clusters K for each individual data source automatically. The HDMA model is designed based on the Dirichlet multinomial allocation (DMA) model, which is described in our previous work on the document clustering approach for a single data source [5], [13]. 3) A Gibbs sampling algorithm is developed to estimate the parameters of HDMA as well as the number of clusters K of each individual data source automatically. The remainder of this paper is organized as follows: Section II reviews the related work of multi-source document clustering.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.