Distributed Clustering of Text Collections

Juan Zamora,Hector Allende-Cid,Marcelo Mendoza

doi:10.1109/access.2019.2949455

Juan Zamora, Hector Allende-Cid + Show 1 more

Open Access

https://doi.org/10.1109/access.2019.2949455

Copy DOI

Abstract

Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.

Highlights

As a consequence of the explosive growth of the web, the integration of search engines into personal computers and mobile devices, and the extensive use of social networks, the clustering of text for document organization has become a crucial aspect for web data management
The results show that C-shared nearest neighbors (SNN) and distributed shared nearest neighbors (D-SNN) are sensitive to parameter tuning
In order to assess its utility, especially for recovering the cluster structure underlying each collection, its performance was compared against two centralized approaches (C-SNN and Graph Clust)

Summary

INTRODUCTION

As a consequence of the explosive growth of the web, the integration of search engines into personal computers and mobile devices, and the extensive use of social networks, the clustering of text for document organization has become a crucial aspect for web data management. Ravichandran et al [36] introduce a modification of SNN to work with high dimensional data It deals with the hubness problem, i.e. how to discard the effect of highly connected points in density estimation. We study how to provide a data driven approach for parameter tuning By addressing all these problems, we show that it is possible to provide a distributed clustering algorithm to deal with text collections. The main strength of the proposed algorithm is its ability to use disjoint data partitions to return high-quality clustering results This strength will avoid centralizing data partitions in a single data node to provide a clustering overview, reducing the computing load for clustering distributed data partitions.

DISTRIBUTED CLUSTERING ALGORITHMS

Result

COST OF THE ALGORITHM

METHODOLOGY AND EXPERIMENTAL RESULTS

EMPIRICAL ASSESSMENT OF THE COMPUTATIONAL COST

Findings

CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2019
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Distributed Clustering of Text Collections

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

A Distributed Shared Nearest Neighbors Clustering Algorithm
Juan Zamora ... Marcelo Mendoza
-
Juan Zamora, et. al.Juan Zamora ... Marcelo Mendoza
01 Jan 2018
01 Jan 2018

Approaches for scaling DBSCAN algorithm to large spatial databases
Aoying Zhou ... Yunfa Hu
Journal of Computer Science and Technology | VOL. 15
Aoying Zhou, et. al.Aoying Zhou ... Yunfa Hu
01 Nov 2000
Journal of Computer Science and Technology | VOL. 15

Interpolation-based k-means Clustering Improvement for Sparse, High Dimensional Data
Wanghu Chen ... Zhen Tian
-
Wanghu Chen, et. al.Wanghu Chen ... Zhen Tian
28 Aug 2019
28 Aug 2019

Constraint Based Subspace Clustering for High Dimensional Uncertain Data
Xianchao Zhang ... Hong Yu
-
Xianchao Zhang, et. al.Xianchao Zhang ... Hong Yu
01 Jan 2015
01 Jan 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Distributed Clustering of Text Collections

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access