Abstract

The increase in number of users and digital content has led to an increase in the size and energy consumption of data centers. An efficient use of energy is essential to address the concerns of cost and sustainability. Many data centers contain MapReduce clusters of hundreds and thousands of machines to efficiently process the infrequent batch and interactive Big Data workloads. Nevertheless, a large number of machines are underutilized for a long time. This makes the MapReduce clusters energy inefficient. In this thesis we focus on improving the energy efficiency of MapReduce clusters to reduce the energy consumption of data centers. MapReduce frameworks automate | the execution of data-parallel tasks on a distributed cluster of commodity nodes; and the replication of data and tasks for reliability. While such a design provides high scalability, fault tolerance and easy programming interface, it poses several challenges to the use of common resource consolidation methods for improving energy efficiency. For instance, the workload consolidation on fewer nodes will have a negative impact on the performance and availability of the system. The use of popular dynamic voltage and frequency scaling mechanisms, which consider only the CPU-utilization, may not be optimal for the IO-intensive data processing MapReduce systems. Likewise, tuning of the MapReduce configuration parameters for energy efficiency is not simple because the number of configuration parameters is large; a parameter can have conflicting impacts on performance and energy; and the parameters are not necessarily orthogonal, that is, changing the value of one parameter can actually influence the impact caused by some other parameters. In this thesis, we use statistical and empirical methods to address the challenges of configuring the parameters to improve the energy efficiency of MapReduce systems without impacting their performance, fault-tolerance and scalability. We first characterize the energy efficiency of MapReduce workloads with respect to the built-in CPU-governors to determine the most effective power settings. Next, we use factorial design of experiments to study the effects of configuration parameters on performance and energy consumption with a view to identify the most influential ones efficiently. We then perform a detailed performance and energy characterization for the critical parameters and derive respective empirical models using the linear regression technique. We analyze the energy and performance models of a variety of MapReduce workloads to understand the relative impact of CPU-frequency and other critical parameters. We further present a MapReduce Configurator, which employs the performance and energy models, to tune the critical parameters for energy efficiency. We perform the characterizations and evaluations on multiple real clusters, each consisting of a MapReduce platform (e.g. Hadoop-1 and Yarn) deployed on a hardware (e.g. nodes with Intel Pentium G-2020 and Intel E5-2450 processor), with benchmark applications ranging from micro-level benchmarks (e.g. wordcount and sort) to macro-level machine learning applications (e.g. Kmeans and Pagerank). With the use of the MapReduce Configurator, we achieve, approximately, 20-100% improvement in energy efficiency of typical MapReduce workloads in two architecturally different clusters. We demonstrate that tuning of just the CPU-frequency setting improves the energy efficiency of machine learning workloads by an average 25% over the default CPU-governor setting. Through extensive empirical evaluations, we establish the generality and effectiveness of our MapReduce Configurator and models. We also observe that the use of energy aware configuration, determined using MapReduce Configurator, reduces the energy consumption of MapReduce clusters without impacting their performance. This helps in reducing the operational costs of data-centers.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call