MapReduce Applications Research Articles

The majority of large-scale data intensive applications executed by data centers are based on MapReduce or its open-source implementation, Hadoop. Such applications are executed on large clusters requiring large amounts of energy, making the energy costs a considerable fraction of the data center’s overall costs. Therefore minimizing the energy consumption when executing each MapReduce job is a critical concern for data centers. In this paper, we propose a framework for improving the energy efficiency of MapReduce applications, while satisfying the service level agreement (SLA). We first model the problem of energy-aware scheduling of a single MapReduce job as an Integer Program. We then propose two heuristic algorithms, called energy-aware MapReduce scheduling algorithms (EMRSA-I and EMRSA-II), that find the assignments of map and reduce tasks to the machine slots in order to minimize the energy consumed when executing the application. We perform extensive experiments on a Hadoop cluster to determine the energy consumption and execution time for several workloads from the HiBench benchmark suite including TeraSort, PageRank, and K-means clustering, and then use this data in an extensive simulation study to analyze the performance of the proposed algorithms. The results show that EMRSA-I and EMRSA-II are able to find near optimal job schedules consuming approximately 40 percent less energy on average than the schedules obtained by a common practice scheduler that minimizes the makespan.

Read full abstract

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a time–energy performance analysis of MapReduce on heterogeneous systems with GPUs. We evaluate the time and energy performance of three MapReduce applications with diverse resource demands on a Hadoop–CUDA framework. As executing these applications on heterogeneous systems with GPUs is challenging, we introduce a novel lazy processing technique which requires no modifications to the underlying Hadoop framework. To analyze the impact of heterogeneity, we compare the heterogeneous CPU+GPU with the homogeneous CPU-only execution across three systems with diverse characteristics, (i) a traditional high-performance (brawny) Intel i7 system hosting a discrete 640-core Nvidia GPU of the latest Maxwell generation, (ii) a wimpy platform consisting of a quad-core ARM Cortex-A9 hosting the same discrete Maxwell GPU, and (iii) a wimpy platform integrating four ARM Cortex-A15 cores and 192 Nvidia Kepler GPU cores on the same chip. These systems encompass both intra-node heterogeneity with discrete GPUs and intra-chip heterogeneity with integrated GPUs. Our measurement-based performance analysis highlights the following results. For compute-intensive workloads, the brawny heterogeneous system achieves speedups of up to 2.3 and reduces the energy usage by almost half compared to the brawny homogeneous system. As expected, for applications where data transfers dominate the execution time, heterogeneity exhibits worse time–energy performance compared to homogeneous systems. For such applications, the heterogeneous wimpy A9 system with discrete GPU uses around 14 times the energy of homogeneous A9 system due to both system resource imbalances and high power overhead of the discrete GPU. However, comparing among heterogeneous systems, the wimpy A15 with integrated GPU uses the lowest energy across all workloads. This allows us to establish an execution time equivalence ratio between a single brawny node and multiple wimpy nodes. Based on this equivalence ratio, the wimpy nodes exhibit energy savings of two-thirds while maintaining the same execution time. This result advocates the potential usage of heterogeneous wimpy systems with integrated GPUs for Big Data analytics.

Read full abstract

MapReduce Applications Research Articles

Related Topics

Articles published on MapReduce Applications

Flame-MR: An event-driven architecture for MapReduce applications

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

An FPGA-based Integrated MapReduce Accelerator Platform

SLA-aware energy-efficient scheduling scheme for Hadoop YARN

Energy Efficient Cloud Service Provisioning: Keeping Data Center Granularity in Perspective

IntegrityMR: Exploring Result Integrity Assurance Solutions for Big Data Computing Applications

De-Identified Personal Health Care System Using Hadoop

Modeling the Performance of MapReduce Applications for the Cloud

Quality of Service Aware Reliable Task Scheduling in Vehicular Cloud Computing

Data-locality-aware mapreduce real-time scheduling framework

Coding Productivity in MapReduce Applications for Distributed and Shared Memory Architectures

BSP cost and scalability analysis for MapReduce operations

DMR: A Deterministic MapReduce for Multicore Systems

Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications

BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastructures validated with the Grid5000 experimental platform

A MapReduce scratchpad memory for multi-core cloud computing applications

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Spatial Locality Aware Disk Scheduling in Virtualized Environment

A time–energy performance analysis of MapReduce on heterogeneous systems with GPUs

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

MapReduce Applications Research Articles

Related Topics

Articles published on MapReduce Applications

Flame-MR: An event-driven architecture for MapReduce applications

Cloudflow - enabling faster biomedical pipelines with MapReduce and Spark

Enabling fast failure recovery in shared Hadoop clusters: Towards failure-aware scheduling

An FPGA-based Integrated MapReduce Accelerator Platform

SLA-aware energy-efficient scheduling scheme for Hadoop YARN

Energy Efficient Cloud Service Provisioning: Keeping Data Center Granularity in Perspective

IntegrityMR: Exploring Result Integrity Assurance Solutions for Big Data Computing Applications

De-Identified Personal Health Care System Using Hadoop

Modeling the Performance of MapReduce Applications for the Cloud

Quality of Service Aware Reliable Task Scheduling in Vehicular Cloud Computing

Data-locality-aware mapreduce real-time scheduling framework

Coding Productivity in MapReduce Applications for Distributed and Shared Memory Architectures

BSP cost and scalability analysis for MapReduce operations

DMR: A Deterministic MapReduce for Multicore Systems

Energy-Aware Scheduling of MapReduce Jobs for Big Data Applications

BIGhybrid: a simulator for MapReduce applications in hybrid distributed infrastructures validated with the Grid5000 experimental platform

A MapReduce scratchpad memory for multi-core cloud computing applications

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Spatial Locality Aware Disk Scheduling in Virtualized Environment

A time–energy performance analysis of MapReduce on heterogeneous systems with GPUs