Combining semantic and term frequency similarities for text clustering

Victor Hugo Andrade Soares,Murilo Coelho Naldi,Seyednaser Nourashrafeddin,Evangelos Milios,Ricardo J G B Campello

doi:10.1007/s10115-018-1278-7

Abstract

A key challenge for document clustering consists in finding a proper similarity measure for text documents that enables the generation of cohesive groups. Measures based on the classic bag-of-words model take into account solely the presence (and frequency) of words in documents. In doing so, semantically similar documents which use different vocabularies may end up in different clusters. For this reason, semantic similarity measures that use external knowledge, such as word n-gram corpora or thesauri, have been proposed in the literature. In this paper, the Frequency Google Tri-gram Measure is proposed to assess similarity between documents based on the frequencies of terms in the compared documents as well as the Google n-gram corpus as an additional semantic similarity source. Clustering algorithms are applied to several real datasets in order to experimentally evaluate the quality of the clusters obtained with the proposed measure and compare it with a number of state-of-the-art measures from the literature. The experimental results demonstrate that the proposed measure improves significantly the quality of document clustering, based on statistical tests. We further demonstrate that clustering results combining bag-of-words and semantic similarity are superior to those obtained with either approach independently.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Combining semantic and term frequency similarities for text clustering

Abstract

Talk to us

Similar Papers

More From: Knowledge and Information Systems

Lead the way for us

Journal: Knowledge and Information Systems	Publication Date: Jan 2, 2019
Citations: 15

Similar Papers

SISR: System for integrating semantic relatedness and similarity measures
Mohamed Ben Aouicha ... Mohamed Ali Hadj Taieb
Soft Computing | VOL. 22
Mohamed Ben Aouicha, et. al.Mohamed Ben Aouicha ... Mohamed Ali Hadj Taieb
21 Nov 2016
Soft Computing | VOL. 22

Using ontology for measuring semantic similarity for question answering system
Muthukrishnan Ramprasath ... Shanmugasundaram Hariharan
-
Muthukrishnan Ramprasath, et. al.Muthukrishnan Ramprasath ... Shanmugasundaram Hariharan
01 Aug 2012
01 Aug 2012

Semantic textual similarity between sentences using bilingual word semantics
Md Shajalal ... Masaki Aono
Progress in Artificial Intelligence | VOL. 8
Md Shajalal, et. al.Md Shajalal ... Masaki Aono
09 Mar 2019
Progress in Artificial Intelligence | VOL. 8

Analysis of tweets to find the basis of popularity based on events semantic similarity
Rajat Kumar Mudgal ... Alfredo Milani
International Journal of Web Information Systems | VOL. 14
Rajat Kumar Mudgal, et. al.Rajat Kumar Mudgal ... Alfredo Milani
27 Nov 2018
International Journal of Web Information Systems | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Combining semantic and term frequency similarities for text clustering

Abstract

Talk to us

Similar Papers

More From: Knowledge and Information Systems