Abstract

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

Highlights

  • Nowadays, a huge amount of textual data exists in digital form

  • The proposed text clustering technique named WEClustering gives a unique way of leveraging the word embeddings to perform text clustering

  • The time to form clusters directly depends on the number a novel and powerful document clustering technique named of dimensions used, the values listed in Table 7 for WEClustering is proposed that combines the advantages of WEClustering are much lower than any other clustering statistical scoring mechanisms such as term frequency (TF)-inverse document frequency (IDF) and state-oftechnique

Read more

Summary

Introduction

A huge amount of textual data exists in digital form. For example, millions of articles in a year are published in thousands of journals of English language alone [20] and their numbers are continuously increasing. The matrix representation of such datasets becomes very sparse (containing a large number of zeros) Traditional clustering techniques such as partitioningbased, hierarchical, and density-based cannot perform well under these circumstances because the dimensionality that they can handle is limited to a few hundred. The proposed text clustering technique named WEClustering gives a unique way of leveraging the word embeddings to perform text clustering. This technique tackles one of the biggest problems of Text mining which is called the curse of dimensionality in its own way so as give more efficient clustering of textual data, especially suitable to the case of big textual datasets.

Related work and background
2: Initialize
Conclusions and future scope
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call