Abstract

The paper is dedicated to solving the problem of optimal text classification in the area of automated detection of typology of texts. In conventional approaches to topicality-based text classification (including topic modeling), the number of clusters is to be set up by the scholar, and the optimal number of clusters, as well as the quality of the model that designates proximity of texts to each other, remain unresolved questions. We propose a novel approach to the automated definition of the optimal number of clusters that also incorporates an assessment of word proximity of texts, combined with text encoding model that is based on the system of sentence embeddings. Our approach combines Universal Sentence Encoder (USE) data pre-processing, agglomerative hierarchical clustering by Ward’s method, and the Markov stopping moment for optimal clustering. The preferred number of clusters is determined based on the “e-2” hypothesis. We set up an experiment on two datasets of real-world labeled data: News20 and BBC. The proposed model is tested against more traditional text representation methods, like bag-of-words and word2vec, to show that it provides a much better-resulting quality than the baseline DBSCAN and OPTICS models with different encoding methods. We use three quality metrics to demonstrate that clustering quality does not drop when the number of clusters grows. Thus, we get close to the convergence of text clustering and text classification.

Highlights

  • Text Classification vs. Text Clustering: Approaches to Automated Text TypologyThe task of finding topical or sentiment-based clusters in real-world textual datasets provides a wide range of methodological problems that are linked to both text clustering as finding the optimal number of clusters and text classification as assignment of a class to a text

  • Approaches as different as DBSCAN- and word2vec-based ones, go “from classification to clustering”, first measuring the word proximity and defining clusters on this basis. For both ways, there are arbitrary parameters that need to be defined by the scholars, such as, e.g., the number of topics for LDA-based topic modelling or the two DBSCAN parameters that limit the number of possible clusters in advance

  • We propose a combination of Universal Sentence Encoder (USE) data pre-processing with sentence embeddings, agglomerative clustering by Ward’s method, and the Markov stopping moment

Read more

Summary

Introduction

Text Classification vs. Text Clustering: Approaches to Automated Text TypologyThe task of finding topical or sentiment-based clusters in real-world textual datasets provides a wide range of methodological problems that are linked to both text clustering as finding the optimal number of clusters and text classification as assignment of a class to a text. LSA- and LDA-based topic modelling goes “from clustering to classification”, by assigning texts to topic slots with probabilities that are based on the probabilistically measured word proximity [2]. Approaches as different as DBSCAN- and word2vec-based ones, go “from classification to clustering”, first measuring the word proximity and defining clusters on this basis. For both ways, there are arbitrary parameters that need to be defined by the scholars, such as, e.g., the number of topics for LDA-based topic modelling or the two DBSCAN parameters that limit the number of possible clusters in advance. DBSCAN-based approaches are widely used, but they are just as widely known for their low performance on data of high dimensionality, e.g., when the number of clusters goes over 10; for resolving this problem, a myriad of extensions have been proposed [6]

Objectives
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call