WEClustering: word embeddings based text clustering technique for large datasets

Vivek Mehta,Seema Bawa,Jasmeet Singh

doi:10.1007/s40747-021-00512-9

Vivek Mehta, Seema Bawa + Show 1 more

Open Access

https://doi.org/10.1007/s40747-021-00512-9

Copy DOI

Abstract

A massive amount of textual data now exists in digital repositories in the form of research articles, news articles, reviews, Wikipedia articles, and books, etc. Text clustering is a fundamental data mining technique to perform categorization, topic extraction, and information retrieval. Textual datasets, especially which contain a large number of documents are sparse and have high dimensionality. Hence, traditional clustering techniques such as K-means, Agglomerative clustering, and DBSCAN cannot perform well. In this paper, a clustering technique especially suitable to large text datasets is proposed that overcome these limitations. The proposed technique is based on word embeddings derived from a recent deep learning model named “Bidirectional Encoders Representations using Transformers”. The proposed technique is named as WEClustering. The proposed technique deals with the problem of high dimensionality in an effective manner, hence, more accurate clusters are formed. The technique is validated on several datasets of varying sizes and its performance is compared with other widely used and state of the art clustering techniques. The experimental comparison shows that the proposed clustering technique gives a significant improvement over other techniques as measured by metrics such Purity and Adjusted Rand Index.

Highlights

Nowadays, a huge amount of textual data exists in digital form
The proposed text clustering technique named WEClustering gives a unique way of leveraging the word embeddings to perform text clustering
The time to form clusters directly depends on the number a novel and powerful document clustering technique named of dimensions used, the values listed in Table 7 for WEClustering is proposed that combines the advantages of WEClustering are much lower than any other clustering statistical scoring mechanisms such as term frequency (TF)-inverse document frequency (IDF) and state-oftechnique

Summary

Introduction

A huge amount of textual data exists in digital form. For example, millions of articles in a year are published in thousands of journals of English language alone [20] and their numbers are continuously increasing. The matrix representation of such datasets becomes very sparse (containing a large number of zeros) Traditional clustering techniques such as partitioningbased, hierarchical, and density-based cannot perform well under these circumstances because the dimensionality that they can handle is limited to a few hundred. The proposed text clustering technique named WEClustering gives a unique way of leveraging the word embeddings to perform text clustering. This technique tackles one of the biggest problems of Text mining which is called the curse of dimensionality in its own way so as give more efficient clustering of textual data, especially suitable to the case of big textual datasets.

Related work and background

2: Initialize

Conclusions and future scope

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Complex & Intelligent Systems	Publication Date: Sep 7, 2021
Citations: 15	License type: open-access

R Discovery Prime

R Discovery Prime

WEClustering: word embeddings based text clustering technique for large datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complex & Intelligent Systems

Lead the way for us

Similar Papers

A comparative analysis of text representation, classification and clustering methods over real project proposals
Meltem Aksoy ... Mehmet Fatih Amasyali
International Journal of Intelligent Computing and Cybernetics | VOL. 16
Meltem Aksoy, et. al.Meltem Aksoy ... Mehmet Fatih Amasyali
28 Feb 2023
International Journal of Intelligent Computing and Cybernetics | VOL. 16

Self-Optimal Clustering Technique Using Optimized Threshold Function
Nishchal K Verma ... Abhishek Roy
IEEE Systems Journal | VOL. 8
Nishchal K Verma, et. al.Nishchal K Verma ... Abhishek Roy
01 Dec 2014
IEEE Systems Journal | VOL. 8

More review content in TMM 2003
A.Romina Emilianus
Trends in Molecular Medicine | VOL. 9
A.Romina EmilianusA.Romina Emilianus
13 Dec 2002
More review content in TMM 2003
A.Romina Emilianus

Analysis of particle swarm optimization based hierarchical data clustering approaches
Shafiq Alam ... Saeed Ur Rehman
Swarm and Evolutionary Computation | VOL. 25
Shafiq Alam, et. al.Shafiq Alam ... Saeed Ur Rehman
23 Oct 2015
Swarm and Evolutionary Computation | VOL. 25

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

WEClustering: word embeddings based text clustering technique for large datasets

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Complex &amp; Intelligent Systems

More From: Complex & Intelligent Systems