Abstract

The big data is one of the fastest growing technologies, which can to handle huge amounts of data from various sources, such as social media, web logs, banking and business sectors etc. In order to pace with the changes in the data patterns and to accommodate the requirements of big data analytics, the platform for storage and processing such as Hadoop, also requires great advancements. Hadoop, an open source project executes the big data processing job in map and reduce phases and follows master-slave architecture. A Hadoop MapReduce job can be delayed if one of its many tasks is being assigned to an unreliable or congested machine. To solve this straggler problem, a novel algorithm design of speculative execution schemes for parallel processing clusters, from an optimization perspective, under different loading conditions is proposed. For the lightly loaded case, a task cloning scheme, namely, the combined file task cloning algorithm, which is based on maximizing the overall system utility, a straggler detection algorithm is proposed based on a workload threshold. The detection and cloning of tasks assigned with the stragglers only will not be enough to enhance the performance unless cloning of tasks is allocated in a resource aware method. So, a method is proposed which identifies and optimizes the resource allocation by considering all possible aspects of cluster performance balancing. One main issue arises due to the pre configuration of distinct map and reduce slots based on the number of files in the input folder. This can cause severe under-utilization of slot as map slots might not be fully utilized with respect to the input splits. To solve this issue, an alternative technique of Hadoop Slot Allocation is introduced in this paper by keeping the efficient management of slots model. The combine file task cloning algorithm combines the files which are less than the size of a single data block and executes them in the highly performing data node. On implementing these efficient cloning and combining techniques on a heavily loaded cluster after detecting the straggler, machine is found to reduce the elapsed time of execution to an average of 40%. The detection algorithm improves the overall performance of the heavily loaded cluster by 20% of the total elapsed time in comparison with the native Hadoop algorithm.

Highlights

  • Apart from the traditional distributed systems, Hadoop differs in the core execution strategy of Data Locality

  • This indicates that the mode of existence and execution of Hadoop differs from the existing data warehouses and relational databases used for data analytics in the past

  • [2] This paper proposes a new dynamic method of implementation known as Maximum Cost Performance (MCP)

Read more

Summary

INTRODUCTION

The handling of the phenomenal data explosion posed a challenge to technolgical firms such as Google, Yahoo, Amazon, and Microsoft. The companies had to sift and sieve through massive amounts of data to find the customer orientations and preferences related to books, adverts and trending websites. Traditional tools for data handling failed in this regard. Google introduced the revolutionary MapReduce system that can handle big data processing. Apart from the traditional distributed systems, Hadoop differs in the core execution strategy of Data Locality. This indicates that the mode of existence and execution of Hadoop differs from the existing data warehouses and relational databases used for data analytics in the past

MapReduce and speculative execution
Objective
Problem definition
Scope of the Work
Expected Outcome
LITERATURE SURVEY
Literature Summary
SYSTEM MODEL
Job service process under speculative execution
Problem formulation
IMPLEMENTATION
Eclipse
Oracle VM VirtualBox
CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.