Experimental Analysis in Hadoop MapReduce: A Closer Look at Fault Detection and Recovery Techniques.

Muntadher Saadoon,Adeleh Asemi,Siti Hafizah Ab Hamid,Asmiza Abdul Sani,Hazrina Sofian,Nur Nasuha,Hamza Altarturi,Zati Hakim Azizul

doi:10.3390/s21113799

Abstract

Hadoop MapReduce reactively detects and recovers faults after they occur based on the static heartbeat detection and the re-execution from scratch techniques. However, these techniques lead to excessive response time penalties and inefficient resource consumption during detection and recovery. Existing fault-tolerance solutions intend to mitigate the limitations without considering critical conditions such as fail-slow faults, the impact of faults at various infrastructure levels and the relationship between the detection and recovery stages. This paper analyses the response time under two main conditions: fail-stop and fail-slow, when they manifest with node, service, and the task at runtime. In addition, we focus on the relationship between the time for detecting and recovering faults. The experimental analysis is conducted on a real Hadoop cluster comprising MapReduce, YARN and HDFS frameworks. Our analysis shows that the recovery of a single fault leads to an average of 67.6% response time penalty. Even though the detection and recovery times are well-turned, data locality and resource availability must also be considered to obtain the optimum tolerance time and the lowest penalties.

Highlights

MapReduce is the most popular data processing model [1], used for Big Data-related applications and services over the cloud
We confirmed experimentally that this penalty was due to the slow fault detection and the waste of resources by fault recovery in the typical Hadoop fault-tolerance
Service fail-stop has the highest response time penalties compared to the other faults; When the size of a task’s workload increases due to a large data block, the recovery time increases because of the re-computation of the entire block; Fault occurrence at the late point of the job lifetime incurs higher penalties for node and service fail-stop and lower penalties for task fail-stop and fail-slow; The current fault-tolerance method does not consider the programming logic of the application to detect and recover faults and failures; The response time decreases when setting small timeout values but with higher resource consumption; The recovery of a single fault leads to an average of 67.6% response time penalty

Summary

Introduction

MapReduce is the most popular data processing model [1], used for Big Data-related applications and services over the cloud. Due to the given flexibility, organisations like Yahoo, Google and Facebook utilise Hadoop MapReduce to successfully manage their data-intensive computations in large-scale computing environments. Hadoop MapReduce is used for supporting the implementation of complex algorithms that require high computation power in a distrusted manner such as anomaly analysis, network intrusion detection, and calculating the network centrality [2,3,4]. In such environments, faults from a node, service or task are common, and they significantly impact the system performance if the fault-tolerance is not properly handled

Results

Discussion

Conclusion