Abstract

Current data processing tasks require efficient approaches capable of dealing with large databases. A promising strategy consists in distributing the data along with several computers that partially solve the undertaken problem. Finally, these partial answers are integrated to obtain a final solution. We introduce distributed shared nearest neighbors (D-SNN), a novel clustering algorithm that work with disjoint partitions of data. Our algorithm produces a global clustering solution that achieves a competitive performance regarding centralized approaches. The algorithm works effectively with high dimensional data, being advisable for document clustering tasks. Experimental results over five data sets show that our proposal is competitive in terms of quality performance measures when compared to state of the art methods.

Highlights

  • As a consequence of the explosive growth of the web, the integration of search engines into personal computers and mobile devices, and the extensive use of social networks, the clustering of text for document organization has become a crucial aspect for web data management

  • The results show that C-shared nearest neighbors (SNN) and distributed shared nearest neighbors (D-SNN) are sensitive to parameter tuning

  • In order to assess its utility, especially for recovering the cluster structure underlying each collection, its performance was compared against two centralized approaches (C-SNN and Graph Clust)

Read more

Summary

INTRODUCTION

As a consequence of the explosive growth of the web, the integration of search engines into personal computers and mobile devices, and the extensive use of social networks, the clustering of text for document organization has become a crucial aspect for web data management. Ravichandran et al [36] introduce a modification of SNN to work with high dimensional data It deals with the hubness problem, i.e. how to discard the effect of highly connected points in density estimation. We study how to provide a data driven approach for parameter tuning By addressing all these problems, we show that it is possible to provide a distributed clustering algorithm to deal with text collections. The main strength of the proposed algorithm is its ability to use disjoint data partitions to return high-quality clustering results This strength will avoid centralizing data partitions in a single data node to provide a clustering overview, reducing the computing load for clustering distributed data partitions.

DISTRIBUTED CLUSTERING ALGORITHMS
Result
COST OF THE ALGORITHM
METHODOLOGY AND EXPERIMENTAL RESULTS
EMPIRICAL ASSESSMENT OF THE COMPUTATIONAL COST
Findings
CONCLUSION AND FUTURE WORK

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.