Clustering of semantically enriched short texts

Marek Kozlowski,Henryk Rybinski

doi:10.1007/s10844-018-0541-4

Marek Kozlowski, Henryk Rybinski

Open Access

https://doi.org/10.1007/s10844-018-0541-4

Copy DOI

Abstract

The paper is devoted to the issue of clustering small sets of very short texts. Such texts are often incomplete and highly inconclusive, so establishing a notion of proximity between them is a challenging task. In order to cope with polysemy we adapt the SenseSearcher algorithm (SnS), by Kozlowski and Rybinski in Computational Intelligence 33(3): 335–367, 2017b. In addition, we test the possibilities of improving the quality of clustering ultra-short texts by means of enriching them semantically. We present two approaches, one based on neural-based distributional models, and the other based on external knowledge resources. The approaches are tested on SnSRC and other knowledge-poor algorithms.

Highlights

Over the last decade, short text clustering has become an active research area in many tasks of Natural Language Processing (NLP)
As a first step of our experiments, we compared one general-purpose clustering algorithm (Bisecting k-means), and three text-oriented ones (STC, Lingo and SnSRC), all of them run on the company data, and without semantic enrichment
Let us note that Bisecting k-means requires defining a priori the number of resulting clusters, whereas the other algorithms automatically determine the number of clusters

Summary

Introduction

Short text clustering has become an active research area in many tasks of Natural Language Processing (NLP). The methods are based on representing text as a bag-of-words (bow), and grouping texts on the basis of their lexical similarity. Most approaches are focused on documents containing at least the size of few tens of words (Ferragina and Scaiella 2012), or a web snippet (Di Marco and Navigli 2013), or a paragraph (Shrestha et al 2012).. Di Marco and Navigli (2013) indicate ambiguity of texts as one of the main problems. Another issue, addressed by Pinto et al (2007), is the problem with clustering short texts with narrow domain characteristics. The length of texts seems to be the main issue; the clustering of short

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Intelligent Information Systems	Publication Date: Dec 18, 2018
Citations: 15	License type: open-access

R Discovery Prime

R Discovery Prime

Clustering of semantically enriched short texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Intelligent Information Systems

Lead the way for us

Similar Papers

The External Knowledge Utilization and Radical Innovation in Korea Electronic Industry
Youngwoo Lee ... Sul-Ki Chang
The East Asian Journal of Business Management | VOL. 6
Youngwoo Lee, et. al.Youngwoo Lee ... Sul-Ki Chang
31 Dec 2019
The East Asian Journal of Business Management | VOL. 6

Clustering small-sized collections of short texts
Lili Kotlerman ... Oren Kurland
Information Retrieval Journal | VOL. 21
Lili Kotlerman, et. al.Lili Kotlerman ... Oren Kurland
30 Nov 2017
Information Retrieval Journal | VOL. 21

Semantic Enriched Short Text Clustering
Marek Kozlowski ... Henryk Rybinski
-
Marek Kozlowski, et. al.Marek Kozlowski ... Henryk Rybinski
01 Jan 2017
01 Jan 2017

Clustering web search results using Wikipedia resource
Chung Tran ... Andrzej Ameljańczyk
Computer Science and Mathematical Modelling | VOL. 0
Chung Tran, et. al.Chung Tran ... Andrzej Ameljańczyk
30 Sep 2020
Computer Science and Mathematical Modelling | VOL. 0

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Clustering of semantically enriched short texts

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Intelligent Information Systems