A cross-collection mixture model for comparative text mining

Chengxiang Zhai,Bei Yu,Atulya Velivelli

doi:10.1145/1014052.1014150

Abstract

In this paper, we define and study a novel text mining problem, which we refer to as Comparative Text Mining (CTM). Given a set of comparable text collections, the task of comparative text mining is to discover any latent common themes across all collections as well as summarize the similarity and differences of these collections along each common theme. This general problem subsumes many interesting applications, including business intelligence and opinion summarization. We propose a generative probabilistic mixture model for comparative text mining. The model simultaneously performs cross-collection clustering and within-collection clustering, and can be applied to an arbitrary set of comparable text collections. The model can be estimated efficiently using the Expectation-Maximization (EM) algorithm. We evaluate the model on two different text data sets (i.e., a news article data set and a laptop review data set), and compare it with a baseline clustering method also based on a mixture model. Experiment results show that the model is quite effective in discovering the latent common themes across collections and performs significantly better than our baseline mixture model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A cross-collection mixture model for comparative text mining

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A mixture model for contextual text mining
Qiaozhu Mei ... Chengxiang Zhai
-
Qiaozhu Mei, et. al.Qiaozhu Mei ... Chengxiang Zhai
20 Aug 2006
20 Aug 2006

Text mining and probabilistic language modeling for online review spam detection
Raymond Y K Lau ... Yunqing Xia
ACM Transactions on Management Information Systems | VOL. 2
Raymond Y K Lau, et. al.Raymond Y K Lau ... Yunqing Xia
01 Dec 2011
ACM Transactions on Management Information Systems | VOL. 2

W&G-Bert: A Concept for a Pre-Trained Automotive Warranty and Goodwill Language Representation Model for Warranty and Goodwill Text Mining
Lukas Jonathan Weber ... Alice Kirchheim
-
Lukas Jonathan Weber, et. al.Lukas Jonathan Weber ... Alice Kirchheim
19 Feb 2022
19 Feb 2022

Modest performance of text mining to extract health outcomes may be almost sufficient for high-quality prognostic model development
Zwierd Grotenhuis ... Artuur M Leeuwenberg
Computers in Biology and Medicine | VOL. 170
Zwierd Grotenhuis, et. al.Zwierd Grotenhuis ... Artuur M Leeuwenberg
23 Jan 2024
Computers in Biology and Medicine | VOL. 170

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A cross-collection mixture model for comparative text mining

Abstract

Talk to us

Similar Papers