Abstract

In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a Big Data. State-of-the-art parallel computing techniques such as the MapReduce guarantee high performance in scenarios where involved computing nodes are equally sized and clustered via broadband network links, and the data are co-located with the cluster of nodes. Unfortunately, the mentioned techniques have proven ineffective in geographically distributed scenarios, i.e., computing contexts where nodes and data are geographically distributed across multiple distant data centers. In the literature, researchers have proposed variants of the MapReduce paradigm that obtain awareness of the constraints imposed in those scenarios (such as the imbalance of nodes computing power and of interconnecting links) to enforce smart task scheduling strategies. We have designed a hierarchical computing framework in which a context-aware scheduler orchestrates computing tasks that leverage the potential of the vanilla Hadoop framework within each data center taking part in the computation. In this work, after presenting the features of the developed framework, we advocate the opportunity of fragmenting the data in a smart way so that the scheduler produces a fairer distribution of the workload among the computing tasks. To prove the concept, we implemented a software prototype of the framework and ran several experiments on a small-scale testbed. Test results are discussed in the last part of the paper.

Highlights

  • In recent years, the continuous increase in data generated and captured by organizations for very different purposes has attracted great attention from both academia and industry

  • The final consideration concerns network imbalance: though it has been tested in a small testbed, the H2F has proven capable to deal with different network links properties, as it managed to take into account network parameters both for data distribution and job scheduling purposes

  • Researchers have been following two main approaches: (a) increasing the awareness of the heterogeneity of computing nodes and network links in improved versions of Hadoop (Geo-Hadoop approach); (b) adopting hierarchical frameworks where a single MapReduce job is split into many sub-jobs that are firstly sent to several nodes where they are elaborated as plain Hadoop jobs and whose results are sent back to a coordinator who merges them (Hierarchical approach)

Read more

Summary

Introduction

The continuous increase in data generated and captured by organizations for very different purposes has attracted great attention from both academia and industry. Much effort has been put into devising effective solutions for crunching Big Data in a single, yet powerful cluster of nodes [4], there has lately been a growing demand for processing and analyzing data that are generated and stored across geo-distributed data centers, to meet the emerging challenges of green and sustainable computing [5] Such a scenario calls for innovative and smarter computing frameworks capable of coping with a tough environment characterized by medium-to-big sized data volume spread across multiple heterogeneous data centers connected to each other via geographic network links.

Background and Motivating Scenario
System Architecture
Result
Jobs Scheduling
Modeling Job’s Execution Paths
Optimal Data Fragmentation
Experiment
Concluding Remarks on Performance Results
Literature Review
Geo-Hadoop Approach
Hierarchical Approach
Comparative Analysis
Findings
Conclusions and Final Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call