Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering

Ruizhang Huang,Yongbin Qin,Yanping Chen,Weijia Xu

doi:10.1109/access.2020.3002107

Abstract

Mining a document structure from multiple data sources in terms of their underlying topics has become an important task of document clustering. The traditional document clustering approach cannot be applied directly to the multi-source document clustering problem. There are three typical difficulties: 1) The topics of different data sources are related but not the same. 2) Usually, each data source has its own focus on topics. 3) The number of clusters of the data sources are not necessarily the same and are not known beforehand. In this paper, based on our previous research, we design a novel multi-source document clustering model, namely, the hierarchical Dirichlet multinomial allocation (HDMA) model, to solve all the above problems. The HDMA model is investigated with a two-step hierarchical topic generation process. Topics are learnt to share their general characteristics across data source, while at the same time preserve the local characteristics of the data source. Each data source is applied with an exclusive topic partition to learn the source-level topic emphasis. A Gibbs sampling algorithm is then used to learn the number of clusters for each data source as well as the parameters of the HDMA model at the same time. Experimental results demonstrate that the HDMA model is effective.

Highlights

As internet technology has rapidly developed, an increasing number of text documents has become available from various heterogeneous data sources
The hierarchical Dirichlet multinomial allocation (HDMA) model is designed based on the Dirichlet multinomial allocation (DMA) model, which is described in our previous work on the document clustering approach for a single data source [5], [13]
2) EXPERIMENTAL RESULTS ON THE REAL DATA CORPORA we evaluated the experimental performance of our proposed HDMA model on two real data corpus

Summary

INTRODUCTION

As internet technology has rapidly developed, an increasing number of text documents has become available from various heterogeneous data sources. An inappropriate value of K distorts the clustering process and results in poor document clustering performance It is useful if a multi-source document clustering approach is able to learn the number of clusters K for each individual data source automatically. The HDMA model is designed based on the Dirichlet multinomial allocation (DMA) model, which is described in our previous work on the document clustering approach for a single data source [5], [13]. 3) A Gibbs sampling algorithm is developed to estimate the parameters of HDMA as well as the number of clusters K of each individual data source automatically. The remainder of this paper is organized as follows: Section II reviews the related work of multi-source document clustering.

RELATED WORK

BACKGROUND

ALGORITHM

DATASETS

EVALUATION METRICS

INFLUENCE OF HYPER-PARAMETERS

CONCLUSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Ontology-Based Searching Over Multiple Networked Data Sources
Liang Xue ... Boqin Feng
-
Liang Xue, et. al.Liang Xue ... Boqin Feng
01 Jan 2004
01 Jan 2004

Forecasting stock price movements with multiple data sources: Evidence from stock market in China
Zhongbao Zhou ... Helu Xiao
Physica A: Statistical Mechanics and its Applications | VOL. 542
Zhongbao Zhou, et. al.Zhongbao Zhou ... Helu Xiao
04 Nov 2019
Physica A: Statistical Mechanics and its Applications | VOL. 542

Approach to Classifying Freight Data Elements across Multiple Data Sources
Dan P K Seedah ... William J O'Brien
Transportation Research Record: Journal of the Transportation Research Board | VOL. 2529
Dan P K Seedah, et. al.Dan P K Seedah ... William J O'Brien
01 Jan 2015
Transportation Research Record: Journal of the Transportation Research Board | VOL. 2529

Clustering on Multi-source Incomplete Data via Tensor Modeling and Factorization
Weixiang Shao ... Lifang He
-
Weixiang Shao, et. al.Weixiang Shao ... Lifang He
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Hierarchical Dirichlet Multinomial Allocation Model for Multi-Source Document Clustering

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access