MapReduce Jobs Research Articles

“More data, more information.” Big data helps businesses and research communities to gain insights and increase productivity. Many public cloud service providers offer Hadoop MapReduce as a service based on pay-per-use via infrastructure as a service on clusters of virtual machines promising on-demand horizontal scaling. These clusters of virtual machines are launched in various physical machines across racks in cloud data centers. Such multi-tenancy negatively introduces performance heterogeneity for Hadoop virtual machines due to hardware heterogeneity and interference from co-located virtual machine. Performance heterogeneity largely affects MapReduce job latency and resource utilization of rented Hadoop virtual clusters. Default MapReduce schedulers assign map/reduce tasks assuming the hardware is homogeneous. Interference-aware schedulers perform by only observing the interference pattern generated by co-located virtual machines. These schedulers do not consider the heterogeneous performance of virtual machines. Therefore, we propose a dynamic ranking-based MapReduce job scheduler that places the map and reduces tasks based on a virtual machine’s performance rank to minimize job latency and improve resource utilization. Our proposed approach calculates the performance score for each virtual machine based on hardware heterogeneity and co-located virtual machine interference. Then, it ranks the virtual machines based on the map and reduce performance separately to place map and reduce tasks. To demonstrate our ideas, we have set a test bed with 29 virtual machines on eight physical machines with different configurations and capacities. We modify a default fair scheduler in Hadoop 2.x to incorporate our ideas and evaluate them with different workloads on the PUMA dataset. The proposed method is then compared against a default fair scheduler (resource-aware) and an interference-aware scheduler based on job latency and resource utilization. Finally, we argue in favor of our approach as it improves resource utilization by 30–65% and overall job latency by up to 30%.

Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for a balanced distribution of tasks and effective utilization of resources. However, such simplistic policy is unable to reconcile the dynamics of different jobs in complex analytic queries, resulting in unfair treatment of different queries, low utilization of system resources, prolonged execution time, and low query throughput. Therefore, we introduce a scheduling framework to address these problems systematically. Our framework includes two techniques: multivariate DAG modeling and two-level query scheduling. Cross-layer semantics percolation allows the flow of query semantics and job dependencies in the DAG to the MapReduce scheduler. With richer semantics information, we build a multivariate model that can accurately predict the execution time of individual MapReduce jobs and gauge the changing size of analytics datasets through selectivity approximation. Furthermore, we introduce two-level query scheduling that can maximize the intra-query job-level concurrency, and at the same time speed up the query-level completion time based on the accurate prediction and queuing of queries. At the job level, we focus on detecting query semantics, predicting the query completion time through an online multivariate linear regression model, thereby increasing job-level parallelism and maximizing data sharing across jobs. At the task level, we focus on balanced data distribution, maximal slot utilization, and optimal data locality of task scheduling. Our experimental results on a set of complex query benchmarks demonstrate that our scheduling framework can significantly improve both fairness and throughput of Hive queries. It can improve query response time by up to 43.9% and 72.8% on average, compared to the Hadoop Fair Scheduling and the Hadoop Capacity Scheduling, respectively. In addition, our two-level scheduler can achieve a query fairness that is, on average, 59.8% better than that of the Hadoop Fair Scheduler.

MapReduce Jobs Research Articles

Related Topics

Articles published on MapReduce Jobs

MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

LSTPD: Least Slack Time-Based Preemptive Deadline Constraint Scheduler for Hadoop Clusters

Performance tuning analysis of spatial operations on Spatial Hadoop cluster with SSD

Indian Premier League Dataset Analytics using Hadoop-Hive

Scheduling MapReduce Jobs on Identical and Unrelated Processors

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Maximizing MapReduce job speed and reliability in the mobile cloud by optimizing task allocation

Comparison of MongoDB and Cassandra Databases for Spectrum Monitoring As-a-Service

Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment

Automatic Parallel Detection of Neovascularization from Retinal Images Using Ensemble of Extreme Learning Machine.

Hadoop MapReduce Job Scheduling Algorithms Survey and Use Cases

Keddah

An Efficient MapReduce-Based Parallel Processing Framework for User-Based Collaborative Filtering

A Review on Storage and Large-Scale Processing of Data-Sets Using Map Reduce, YARN, SPARK, AVRO, MongoDB

Deadline-Aware MapReduce Job Scheduling with Dynamic Resource Availability

Distributed parallel deep learning of Hierarchical Extreme Learning Machine for multimode quality prediction with big process data

McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce

A framework and a performance assessment for serverless MapReduce on AWS Lambda

Multivariate modeling and two-level scheduling of analytic queries

Cross-Cloud MapReduce for Big Data

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

MapReduce Jobs Research Articles

Related Topics

Articles published on MapReduce Jobs

MR-BIRCH: A scalable MapReduce-based BIRCH clustering algorithm

LSTPD: Least Slack Time-Based Preemptive Deadline Constraint Scheduler for Hadoop Clusters

Performance tuning analysis of spatial operations on Spatial Hadoop cluster with SSD

Indian Premier League Dataset Analytics using Hadoop-Hive

Scheduling MapReduce Jobs on Identical and Unrelated Processors

Improving MapReduce scheduler for heterogeneous workloads in a heterogeneous environment

Maximizing MapReduce job speed and reliability in the mobile cloud by optimizing task allocation

Comparison of MongoDB and Cassandra Databases for Spectrum Monitoring As-a-Service

Dynamic ranking-based MapReduce job scheduler to exploit heterogeneous performance in a virtualized environment

Automatic Parallel Detection of Neovascularization from Retinal Images Using Ensemble of Extreme Learning Machine.

Hadoop MapReduce Job Scheduling Algorithms Survey and Use Cases

Keddah

An Efficient MapReduce-Based Parallel Processing Framework for User-Based Collaborative Filtering

A Review on Storage and Large-Scale Processing of Data-Sets Using Map Reduce, YARN, SPARK, AVRO, MongoDB

Deadline-Aware MapReduce Job Scheduling with Dynamic Resource Availability

Distributed parallel deep learning of Hierarchical Extreme Learning Machine for multimode quality prediction with big process data

McTAR: A Multi-Trigger Checkpointing Tactic for Fast Task Recovery in MapReduce

A framework and a performance assessment for serverless MapReduce on AWS Lambda

Multivariate modeling and two-level scheduling of analytic queries

Cross-Cloud MapReduce for Big Data