MapReduce Cluster Research Articles

As MapReduce is becoming ubiquitous in large-scale data analysis, many recent studies have shown that the performance of MapReduce could be improved by different job scheduling approaches, e.g., Fair Scheduler and Capacity Scheduler. However, most exiting MapReduce job schedulers focus on the scenario that MapReduce cluster is stable and pay little attention to the MapReduce cluster with dynamic resource availability. In fact, MapReduce cluster resources may fluctuate as there is a growing number of Hadoop clusters deployed on hybrid systems, e.g., infrastructure powered by mix of traditional and renewable energy, and cloud platforms hosting heterogeneous workloads. Thus, there is a growing need for providing predictable services to users who have strict requirements on job completion times in such dynamic environments. In this paper, we propose, RDS , a Resource and Deadline-aware Hadoop job Scheduler that takes future resource availability into consideration when minimizing job deadline misses. We formulate the job scheduling problem as an online optimization problem and solve it using an efficient receding horizon control algorithm. To aid the control, we design a self-learning model to estimate job completion times. We further extend the design of RDS scheduler to support flexible performance goals in various dynamic clusters. In particular, we use flexible deadline time bounds instead of the single fixed job completion deadline. We have implemented RDS in the open-source Hadoop implementation and performed evaluations with various benchmark workloads. Experimental results show that RDS substantially reduces the penalty of deadline misses by at least 36 and 10 percent compared with Fair Scheduler and Earliest Deadline First (EDF) scheduler, respectively. In a Hadoop cluster running partially on renewable energy, the experimental result shows the green power based resource prediction approach can further reduce the penalty of deadline misses by 16 percent compared to Auto-Regressive Integrated Moving Average (ARIMA) prediction approach.

Read full abstract

Cloud computing has become a compelling paradigm built on compute and storage virtualization technologies. The current virtualization solution in the Cloud widely relies on hypervisor-based technologies. Given the recent booming of the container ecosystem, the container-based virtualization starts receiving more attention for being a promising alternative. Although the container technologies are generally considered to be lightweight, no virtualization solution is ideally resource-free, and the corresponding performance overheads will lead to negative impacts on the quality of Cloud services. To facilitate understanding container technologies from the performance engineering's perspective, we conducted two-stage performance investigations into Docker containers as a concrete example. At the first stage, we used a physical machine with “just-enough” resource as a baseline to investigate the performance overhead of a standalone Docker container against a standalone virtual machine (VM). With findings contrary to the related work, our evaluation results show that the virtualization's performance overhead could vary not only on a feature-by-feature basis but also on a job-to-job basis. Moreover, the hypervisor-based technology does not come with higher performance overhead in every case. For example, Docker containers particularly exhibit lower QoS in terms of storage transaction speed. At the ongoing second stage, we employed a physical machine with “fair-enough” resource to implement a container-based MapReduce application and try to optimize its performance. In fact, this machine failed in affording VM-based MapReduce clusters in the same scale. The performance tuning results show that the effects of different optimization strategies could largely be related to the data characteristics. For example, LZO compression can bring the most significant performance improvement when dealing with text data in our case. (Less)

Read full abstract

MapReduce Cluster Research Articles

Related Topics

Articles published on MapReduce Cluster

YARN Schedulers for Hadoop MapReduce Jobs: Design Goals, Issues and Taxonomy

A Parallel Fractional Lion Algorithm for Data Clustering Based on MapReduce Cluster Framework

Enhancing Leakage Prevention for MapReduce

RETRACTED ARTICLE: Urban ecological environment investigation based on a cloud computing platform and optimization of computer neural network algorithm

Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster

Towards Greening MapReduce Clusters Considering Both Computation Energy and Cooling Energy

Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms

Distributed Approach to Process Satellite Image Edge Detection on Hadoop Using Artificial Bee Colony

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Computationally Efficient Simulation of Queues: The R Package queuecomputer

Energy aware task scheduling using hybrid firefly - GA in big data

Energy aware task scheduling using hybrid firefly - GA in big data

A Comprehensive View of Scheduling Algorithms for MapReduce Framework in Hadoop

Deadline-Aware MapReduce Job Scheduling with Dynamic Resource Availability

Spark for Social Science

A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters

A MapReduce implementation of posterior probability clustering and relevance models for recommendation

Prediction-Based and Locality-Aware Task Scheduling for Parallelizing Video Transcoding Over Heterogeneous MapReduce Cluster

Two-Stage Performance Engineering of Container-based Virtualization

The optimization for recurring queries in big data analysis system with MapReduce

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

MapReduce Cluster Research Articles

Related Topics

Articles published on MapReduce Cluster

YARN Schedulers for Hadoop MapReduce Jobs: Design Goals, Issues and Taxonomy

A Parallel Fractional Lion Algorithm for Data Clustering Based on MapReduce Cluster Framework

Enhancing Leakage Prevention for MapReduce

RETRACTED ARTICLE: Urban ecological environment investigation based on a cloud computing platform and optimization of computer neural network algorithm

Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster

Towards Greening MapReduce Clusters Considering Both Computation Energy and Cooling Energy

Evaluating the Effects of Modern Storage Devices on the Efficiency of Parallel Machine Learning Algorithms

Distributed Approach to Process Satellite Image Edge Detection on Hadoop Using Artificial Bee Colony

Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach

Computationally Efficient Simulation of Queues: The R Package queuecomputer

Energy aware task scheduling using hybrid firefly - GA in big data

Energy aware task scheduling using hybrid firefly - GA in big data

A Comprehensive View of Scheduling Algorithms for MapReduce Framework in Hadoop

Deadline-Aware MapReduce Job Scheduling with Dynamic Resource Availability

Spark for Social Science

A Predictive Map Task Scheduler for Optimizing Data Locality in MapReduce Clusters

A MapReduce implementation of posterior probability clustering and relevance models for recommendation

Prediction-Based and Locality-Aware Task Scheduling for Parallelizing Video Transcoding Over Heterogeneous MapReduce Cluster

Two-Stage Performance Engineering of Container-based Virtualization

The optimization for recurring queries in big data analysis system with MapReduce