A parallel text clustering method using Spark and hashing

Mohamed Aymen Ben Hajkacem,Nadia Essoussi,Chiheb-Eddine Ben N’Cir

doi:10.1007/s00607-021-00932-y

Abstract

Clustering textual data has become an important task in data analytics since several applications require to automatically organizing large amounts of textual documents into homogeneous topics. The increasing growth of available textual data from web, social networks and open platforms have challenged this task. It becomes important to design scalable clustering method able to effectively organize huge amount of textual data into topics. In this context, we propose a new parallel text clustering method based on Spark framework and hashing. The proposed method deals simultaneously with the issue of clustering huge amount of documents and the issue of high dimensionality of textual data by respectively integrating the divide and conquer approach and implementing a new document hashing strategy. These two facts have shown an important improvement of scalability and a good approximation of clustering quality results. Experiments performed on several large collections of documents have shown the effectiveness of the proposed method compared to existing ones in terms of running time and clustering accuracy.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A parallel text clustering method using Spark and hashing

Abstract

Talk to us

Similar Papers

More From: Computing

Lead the way for us

Similar Papers

A parallel clustering method combined information bottleneck theory and centroid-based clustering
Zhanquan Sun ... Weidong Gu
The Journal of Supercomputing | VOL. 69
Zhanquan Sun, et. al.Zhanquan Sun ... Weidong Gu
04 Apr 2014
The Journal of Supercomputing | VOL. 69

Integrating unsupervised and supervised clustering methods on a GPU platform for fast image segmentation
Alberto Faro ... Simone Palazzo
-
Alberto Faro, et. al.Alberto Faro ... Simone Palazzo
01 Oct 2012
01 Oct 2012

Spark Based Text Clustering Method Using Hashing
Mohamed Aymen Ben Hajkacem ... Chiheb-Eddine Ben N’Cir
-
Mohamed Aymen Ben Hajkacem, et. al.Mohamed Aymen Ben Hajkacem ... Chiheb-Eddine Ben N’Cir
01 Jan 2020
01 Jan 2020

Misty Mountain clustering: application to fast unsupervised flow cytometry gating
István P Sugár ... Stuart C Sealfon
BMC Bioinformatics | VOL. 11
István P Sugár, et. al.István P Sugár ... Stuart C Sealfon
09 Oct 2010
BMC Bioinformatics | VOL. 11

Journal: Computing	Publication Date: Apr 7, 2021
Citations: 4

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A parallel text clustering method using Spark and hashing

Abstract

Talk to us

Similar Papers

More From: Computing