Over the past decade, Apache Hadoop has become a leading framework for big data processing. Single board computer (SBC) clusters, predominantly adopting Raspberry Pi (RPi), have been employed to explore the potential of MapReduce processing in terms of low power and cost because, capital costs aside, power consumption has also become a primary concern in many industries. After building SBC clusters, it is prevalent to consider adding more nodes, particularly newer generation SBCs, to the existing clusters or replacing old (or inactive) nodes with new ones to improve performance, inevitably causing heterogeneous SBC clusters. The Hadoop framework on these heterogeneous SBC clusters creates challenging new problems due to computing resource discrepancies in each node. Native Hadoop does not carefully consider the heterogeneity of the cluster nodes. Consequently, heterogeneous SBC Hadoop clusters result in significant performance variation or, more critically, persistent node failures. This paper proposes a new Hadoop Yet Another Resource Negotiator (YARN) architecture design to improve MapReduce performance on heterogeneous SBC Hadoop clusters with tight computing resources. We newly implement two main scheduling policies on Hadoop YARN based on the correct computing resource information that each SBC node provides: (1) two (master-driven vs. slave-driven) MapReduce task scheduling frameworks to determine more effective processing modes and (2) ApplicationMaster (AM) and reduce task distribution mechanisms to provide the best Hadoop performance by minimizing performance variation. Thus, the proposed Hadoop framework makes the best use of the performance-frugal SBC Hadoop cluster by intelligently distributing MapReduce tasks to each node. To our knowledge, the proposed framework is the first redesigned Hadoop YARN architecture to address various challenging problems particularly on heterogeneous SBC Hadoop clusters for big data processing. The extensive experiments with Hadoop benchmarks demonstrate that the redesigned framework performs better performance than the native Hadoop by an average of 2.55× and 1.55× under I/O intensive and CPU-intensive workloads, respectively.
Read full abstract