HaRD: a heterogeneity-aware replica deletion for HDFS

Hilmi Egemen Ciritoglu,John Murphy,Christina Thorpe

doi:10.1186/s40537-019-0256-6

Hilmi Egemen Ciritoglu, John Murphy + Show 1 more

Open Access

PDF Available

https://doi.org/10.1186/s40537-019-0256-6

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop’s default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.

Highlights

In recent years, the number of data sources is increasing exponentially (e.g., IoT devices and social media applications), and data is incessantly produced every second
Improvements become more compelling when the system is highly-utilised by a large number of concurrent requests, and increase to 60% and 17% compared to Hadoop distributed file system (HDFS) and Workload-aware Balanced Replica Deletion (WBRD), respectively
We extend the formal definition of the replica deletion problem to heterogeneous clusters

Summary

Introduction

The number of data sources is increasing exponentially (e.g., IoT devices and social media applications), and data is incessantly produced every second. Processing large data-sets in order to extract meaningful information has become vital for business success and has created the demand for large-scale distributed data-intensive systems [1,2,3]. HDFS [6] is one of four core modules of the Hadoop Project [4] and is responsible for storing data in a distributed fashion. HDFS is highly scalable and capable of storing tremendous data-sets on a large number of commodity machines. On such a scale, node failures are more than a theoretical probability and can occur for various reasons, e.g., hardware failure, power losses. The NN manages data requests and directs them to relevant DNs. DataNode, the slave, is responsible for storing blocks and serving blocks for data requests. The number of DNs can scale to thousands and can store tens of petabytes [6]

Objectives

Methods

Results

Conclusion

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Big Data	Publication Date: Oct 21, 2019
Citations: 14	License type: open-access

R Discovery Prime

HaRD: a heterogeneity-aware replica deletion for HDFS

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Journal of Big Data

Lead the way for us

Similar Papers

Towards a Better Replica Management for Hadoop Distributed File System
Hilmi Egemen Ciritoglu ... Teodora Sandra Buda
-
Hilmi Egemen Ciritoglu, et. al.Hilmi Egemen Ciritoglu ... Teodora Sandra Buda
01 Jul 2018
01 Jul 2018

Improving HDFS write performance using efficient replica placement
Patel Neha M ... Mosin I Hasan
-
Patel Neha M, et. al. Patel Neha M ... Mosin I Hasan
01 Sep 2014
01 Sep 2014

Performability Comparison of Lustre and HDFS for MR Applications
Rekha Singhal ... Kishor Trivedi
-
Rekha Singhal, et. al.Rekha Singhal ... Kishor Trivedi
01 Nov 2014
01 Nov 2014

Locality Sensitive Hashing based incremental clustering for creating affinity groups in Hadoop — HDFS - An infrastructure extension
A Kala Karun ... K Chitharanjan
-
A Kala Karun, et. al.A Kala Karun ... K Chitharanjan
01 Mar 2013
01 Mar 2013

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

HaRD: a heterogeneity-aware replica deletion for HDFS

Abstract

Highlights

Summary

Published Version (Free)

Talk to us

Similar Papers

More From: Journal of Big Data