A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

Salah Taamneh,Alaa E Abdallah,Hani Bani-Salameh,Mo’Taz Al-Hami

doi:10.3390/data6070073

Salah Taamneh, Alaa E Abdallah + Show 2 more

Open Access

PDF Available

https://doi.org/10.3390/data6070073

Copy DOI

Export

Save

Cite

Journal: Data	Publication Date: Jul 7, 2021
Citations: 1	License type: CC BY 4.0

Affiliation: Hashemite University

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance.

Highlights

Commodity machines are everywhere and often available in large numbers
Iterative algorithms are considered a good fit for Single-Instruction Multiple-Data (SIMD) parallel processing [4]
We propose a robust distributed k-means algorithm on a cluster of commodity machines connected in a peer-to-peer fashion

Summary

Introduction

Commodity machines are everywhere and often available in large numbers. Such machines, if put together, represent tremendous computing power. A grid of commodity machines consists of autonomous machines connected via a LAN network. These machines interact with each other to solve computational problems that cannot be solved individually. Distributed systems with loosely coupled machines are not considered the best choice for fine-grained parallel programs, as the latency delay caused by the frequent communication over the network would significantly degrade the overall performance. Such systems are mainly used for running coarse-grained parallel programs where the communication/computation ratio is low [2]. Programs with coarse-grained parallelism are characterized by their low communication and synchronization overhead

Methods

Results

Conclusion