Abstract

The paper is devoted to the issue of clustering small sets of very short texts. Such texts are often incomplete and highly inconclusive, so establishing a notion of proximity between them is a challenging task. In order to cope with polysemy we adapt the SenseSearcher algorithm (SnS), by Kozlowski and Rybinski in Computational Intelligence 33(3): 335–367, 2017b. In addition, we test the possibilities of improving the quality of clustering ultra-short texts by means of enriching them semantically. We present two approaches, one based on neural-based distributional models, and the other based on external knowledge resources. The approaches are tested on SnSRC and other knowledge-poor algorithms.

Highlights

  • Over the last decade, short text clustering has become an active research area in many tasks of Natural Language Processing (NLP)

  • As a first step of our experiments, we compared one general-purpose clustering algorithm (Bisecting k-means), and three text-oriented ones (STC, Lingo and SnSRC), all of them run on the company data, and without semantic enrichment

  • Let us note that Bisecting k-means requires defining a priori the number of resulting clusters, whereas the other algorithms automatically determine the number of clusters

Read more

Summary

Introduction

Short text clustering has become an active research area in many tasks of Natural Language Processing (NLP). The methods are based on representing text as a bag-of-words (bow), and grouping texts on the basis of their lexical similarity. Most approaches are focused on documents containing at least the size of few tens of words (Ferragina and Scaiella 2012), or a web snippet (Di Marco and Navigli 2013), or a paragraph (Shrestha et al 2012).. Di Marco and Navigli (2013) indicate ambiguity of texts as one of the main problems. Another issue, addressed by Pinto et al (2007), is the problem with clustering short texts with narrow domain characteristics. The length of texts seems to be the main issue; the clustering of short

Objectives
Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.