Hadoop YARN Research Articles

In this paper, we consider a deadline-constrained MR scheduling problem of minimizing energy consumption in Hadoop’s generic resource manager known as yet another resource negotiator. The problem has been modeled as an integer programming (IP) problem using the time-indexed decision variables. We propose two solution approaches to the problem. First, we give a heuristic algorithm that generates sub-optimal schedules in polynomial time. Second, we propose a novel constraint programming (CP) model (as an alternative to the IP model) which always generates optimal schedules when solved by a CP solver. The CP technique is a relatively new and an alternative approach to IP-based branch-and-cut algorithm to exactly solve NP-hard optimization problems. We performed several experiments to compare both proposed solution approaches over real data traces of a wide variety of MR jobs from the HiBench and PUMA benchmark suite. It is noticed that for large-scale big data jobs, the heuristic algorithm provides sub-optimal results in a very small amount of time. On the other hand, the CP approach not only gives optimal results but also takes a small amount of time when compared to IP-based approaches. Therefore, it can be used in non-time-critical situations for getting an optimal schedule. Besides this, a few experiments were also performed to compare the tightest satisfiable deadline under both approaches with the conclusion that the CP technique is able to produce optimal schedules in tighter deadline constraints than the heuristic approach. Moreover, we investigate the sensitivity of total energy consumption of tasks and the execution time of both approaches separately on the number of tasks and deadlines.

Read full abstract

As large-scale data analytic becomes norm in various industries, using MapReduce frameworks to analyze ever-increasing volumes of data will keep growing. In turn, this trend drives up the intention to move MapReduce into multi-tenant clouds. However, the application performance of MapReduce can be significantly affected by the time-varying network bandwidth in a shared cluster. Although many recent studies improve MapReduce performance by dynamic scheduling to reduce the shuffle traffic, most of them do not consider the impact by widely existing hierarchical network architectures in data centers. In this paper, we propose and design a Hierarchical topology (Hit) aware MapReduce scheduler to minimize overall data traffic cost and hence to reduce job execution time. We first formulate the problem as a Topology Aware Assignment (TAA) optimization problem while considering dynamic computing and communication resources in the cloud with hierarchical network architecture. We further develop a synergistic strategy to solve the TAA problem by using the stable matching theory, which ensures the preference of both individual tasks and hosting machines. Finally, we implement the proposed scheduler as a pluggable module on Hadoop YARN and evaluate its performance by testbed experiments and simulations. The testbed experimental results show Hit-scheduler can improve job completion time by 28% and 11% compared to Capacity Scheduler and Probabilistic Network-Aware scheduler, respectively. Our simulations further demonstrate that Hit-scheduler can reduce the traffic cost by 38% at most and the average shuffle flow traffic time by 32% compared to Capacity scheduler. In this manuscript, we have extended Hit-scheduler to a decentralized heuristic scheme to perform the policy-aware allocation in data center environments. Many existing centralized approximation approaches are too complex and infeasible to implement over a data center, which typically include large amounts of servers, containers, switches and traffic flows. In the extension, we have designed a decentralized heuristic scheme to perform the Policy-Aware Task (PAT) allocation by using existing centralize algorithm to approximately maximize the total gained utility. Finally, the simulation based experimental results show that the proposed PAT policy reduces the communication cost by 33.6% compared with the default scheduler in data centers.

Read full abstract

Hadoop YARN Research Articles

Related Topics

Articles published on Hadoop YARN

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.

Serverless-like platform for container-based YARN clusters

New YARN sharing GPU based on graphics memory granularity scheduling

Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN

New efficient Hadoop scheduler: Generalized particle swarm optimization and simulated annealing‐dominant resource fairness

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

PAS: Performance-Aware Job Scheduling for Big Data Processing Systems

Analysis of MapReduce operation in Hadoop YARN and Rack-Aware Resource Management System for YARN

A Dynamic Scaling Approach in Hadoop YARN

New YARN Non-Exclusive Resource Management Scheme through Opportunistic Idle Resource Assignment

Constraint programming versus heuristic approach to MapReduce scheduling problem in Hadoop YARN for energy minimization

New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters

BigDataSDNSim: A simulator for analyzing big data applications in software‐defined cloud data centers

A heuristic method towards deadline-aware energy-efficient mapreduce scheduling problem in Hadoop YARN

Improvement of job completion time in data-intensive cloud computing applications

Joint Optimization of MapReduce Scheduling and Network Policy in Hierarchical Data Centers

A rack-aware scalable resource management system for Hadoop YARN

A rack-aware scalable resource management system for Hadoop YARN

An Adaptive Efficiency-Fairness Meta-Scheduler for Data-Intensive Computing

JouleMR: Towards Cost-Effective and Green-Aware Data Processing Frameworks

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Hadoop YARN Research Articles

Related Topics

Articles published on Hadoop YARN

A distributed data processing scheme based on Hadoop for synchrotron radiation experiments.

Serverless-like platform for container-based YARN clusters

New YARN sharing GPU based on graphics memory granularity scheduling

Performance Improvement through Novel Adaptive Node and Container Aware Scheduler with Resource Availability Control in Hadoop YARN

New efficient Hadoop scheduler: Generalized particle swarm optimization and simulated annealing‐dominant resource fairness

Comprehensive techniques for multi-tenant deep learning framework on a Hadoop YARN cluster

PAS: Performance-Aware Job Scheduling for Big Data Processing Systems

Analysis of MapReduce operation in Hadoop YARN and Rack-Aware Resource Management System for YARN

A Dynamic Scaling Approach in Hadoop YARN

New YARN Non-Exclusive Resource Management Scheme through Opportunistic Idle Resource Assignment

Constraint programming versus heuristic approach to MapReduce scheduling problem in Hadoop YARN for energy minimization

New Scheduling Algorithms for Improving Performance and Resource Utilization in Hadoop YARN Clusters

BigDataSDNSim: A simulator for analyzing big data applications in software‐defined cloud data centers

A heuristic method towards deadline-aware energy-efficient mapreduce scheduling problem in Hadoop YARN

Improvement of job completion time in data-intensive cloud computing applications

Joint Optimization of MapReduce Scheduling and Network Policy in Hierarchical Data Centers

A rack-aware scalable resource management system for Hadoop YARN

A rack-aware scalable resource management system for Hadoop YARN

An Adaptive Efficiency-Fairness Meta-Scheduler for Data-Intensive Computing

JouleMR: Towards Cost-Effective and Green-Aware Data Processing Frameworks