Understanding the effects and implications of compute node related failures in hadoop

Florin Dinu,T.S Eugene Ng

doi:10.1145/2287076.2287108

Abstract

Hadoop has become a critical component in today's cloud environment. Ensuring good performance for Hadoop is paramount for the wide-range of applications built on top of it. In this paper we analyze Hadoop's behavior under failures involving compute nodes. We find that even a single failure can result in inflated, variable and unpredictable job running times, all undesirable properties in a distributed system. We systematically track the causes underlying this distressing behavior. First, we find that Hadoop makes unrealistic assumptions about task progress rates. These assumptions can be easily invalidated by the cloud environment and, more surprisingly, by Hadoop's own design decisions. The result are significant inefficiencies in Hadoop's speculative execution algorithm. Second, failures are re-discovered individually by each task at the cost of great degradation in job running time. The reason is that Hadoop focuses on extreme scalability and thus trades off possible improvements resulting from sharing failure information between tasks. Third, Hadoop does not consider the causes of connection failures between its tasks. We show that the resulting overloading of connection failure semantics unnecessarily causes an otherwise localized failure to propagate to healthy tasks. We also discuss the implications of our findings and draw attention to new ways of improving Hadoop-like frameworks.

Full Text