Improved Hadoop Cluster Performance by Dynamic Load and Resource Aware Speculative Execution and Straggler Node Detection

Juby Mathew,Thomas Scaria,Terry Jacob Mathew

doi:10.35940/ijeat.d8017.049420

Abstract

The big data is one of the fastest growing technologies, which can to handle huge amounts of data from various sources, such as social media, web logs, banking and business sectors etc. In order to pace with the changes in the data patterns and to accommodate the requirements of big data analytics, the platform for storage and processing such as Hadoop, also requires great advancements. Hadoop, an open source project executes the big data processing job in map and reduce phases and follows master-slave architecture. A Hadoop MapReduce job can be delayed if one of its many tasks is being assigned to an unreliable or congested machine. To solve this straggler problem, a novel algorithm design of speculative execution schemes for parallel processing clusters, from an optimization perspective, under different loading conditions is proposed. For the lightly loaded case, a task cloning scheme, namely, the combined file task cloning algorithm, which is based on maximizing the overall system utility, a straggler detection algorithm is proposed based on a workload threshold. The detection and cloning of tasks assigned with the stragglers only will not be enough to enhance the performance unless cloning of tasks is allocated in a resource aware method. So, a method is proposed which identifies and optimizes the resource allocation by considering all possible aspects of cluster performance balancing. One main issue arises due to the pre configuration of distinct map and reduce slots based on the number of files in the input folder. This can cause severe under-utilization of slot as map slots might not be fully utilized with respect to the input splits. To solve this issue, an alternative technique of Hadoop Slot Allocation is introduced in this paper by keeping the efficient management of slots model. The combine file task cloning algorithm combines the files which are less than the size of a single data block and executes them in the highly performing data node. On implementing these efficient cloning and combining techniques on a heavily loaded cluster after detecting the straggler, machine is found to reduce the elapsed time of execution to an average of 40%. The detection algorithm improves the overall performance of the heavily loaded cluster by 20% of the total elapsed time in comparison with the native Hadoop algorithm.

Highlights

Apart from the traditional distributed systems, Hadoop differs in the core execution strategy of Data Locality
This indicates that the mode of existence and execution of Hadoop differs from the existing data warehouses and relational databases used for data analytics in the past
[2] This paper proposes a new dynamic method of implementation known as Maximum Cost Performance (MCP)

Summary

INTRODUCTION

The handling of the phenomenal data explosion posed a challenge to technolgical firms such as Google, Yahoo, Amazon, and Microsoft. The companies had to sift and sieve through massive amounts of data to find the customer orientations and preferences related to books, adverts and trending websites. Traditional tools for data handling failed in this regard. Google introduced the revolutionary MapReduce system that can handle big data processing. Apart from the traditional distributed systems, Hadoop differs in the core execution strategy of Data Locality. This indicates that the mode of existence and execution of Hadoop differs from the existing data warehouses and relational databases used for data analytics in the past

MapReduce and speculative execution

Objective

Problem definition

Scope of the Work

Expected Outcome

LITERATURE SURVEY

Literature Summary

SYSTEM MODEL

Job service process under speculative execution

Problem formulation

IMPLEMENTATION

Eclipse

Oracle VM VirtualBox

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Improved Hadoop Cluster Performance by Dynamic Load and Resource Aware Speculative Execution and Straggler Node Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Engineering and Advanced Technology

Lead the way for us

Journal: International Journal of Engineering and Advanced Technology	Publication Date: Apr 30, 2020
License type: cc-by

Similar Papers

Cluster Performance by Dynamic Load and Resource-Aware Speculative Execution
Juby Mathew
-
Juby MathewJuby Mathew
01 Jan 2020
01 Jan 2020

Optimization for Speculative Execution in Big Data Processing Clusters
Huanle Xu ... Wing Cheong Lau
IEEE Transactions on Parallel and Distributed Systems | VOL. 28
Huanle Xu, et. al.Huanle Xu ... Wing Cheong Lau
01 Jan 2015
IEEE Transactions on Parallel and Distributed Systems | VOL. 28

Application of big data analysis technology based on Hadoop framework in agricultural soil improvement
Yanqin Zhang ... Zhanling Zhang
-
Yanqin Zhang, et. al.Yanqin Zhang ... Zhanling Zhang
18 Oct 2022
18 Oct 2022

Optimized Speculative Execution Strategy for Different Workload Levels in Heterogeneous Spark Cluster
Xiaohan Huang ... Youlong Luo
-
Xiaohan Huang, et. al.Xiaohan Huang ... Youlong Luo
01 Jan 2019
01 Jan 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Improved Hadoop Cluster Performance by Dynamic Load and Resource Aware Speculative Execution and Straggler Node Detection

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Engineering and Advanced Technology