A Hierarchical Hadoop Framework to Process Geo-Distributed Big Data

Giuseppe Di Modica,Orazio Tomarchio

doi:10.3390/bdcc6010005

Giuseppe Di Modica, Orazio Tomarchio

Open Access

https://doi.org/10.3390/bdcc6010005

Copy DOI

Abstract

In the past twenty years, we have witnessed an unprecedented production of data worldwide that has generated a growing demand for computing resources and has stimulated the design of computing paradigms and software tools to efficiently and quickly obtain insights on such a Big Data. State-of-the-art parallel computing techniques such as the MapReduce guarantee high performance in scenarios where involved computing nodes are equally sized and clustered via broadband network links, and the data are co-located with the cluster of nodes. Unfortunately, the mentioned techniques have proven ineffective in geographically distributed scenarios, i.e., computing contexts where nodes and data are geographically distributed across multiple distant data centers. In the literature, researchers have proposed variants of the MapReduce paradigm that obtain awareness of the constraints imposed in those scenarios (such as the imbalance of nodes computing power and of interconnecting links) to enforce smart task scheduling strategies. We have designed a hierarchical computing framework in which a context-aware scheduler orchestrates computing tasks that leverage the potential of the vanilla Hadoop framework within each data center taking part in the computation. In this work, after presenting the features of the developed framework, we advocate the opportunity of fragmenting the data in a smart way so that the scheduler produces a fairer distribution of the workload among the computing tasks. To prove the concept, we implemented a software prototype of the framework and ran several experiments on a small-scale testbed. Test results are discussed in the last part of the paper.

Highlights

In recent years, the continuous increase in data generated and captured by organizations for very different purposes has attracted great attention from both academia and industry
The final consideration concerns network imbalance: though it has been tested in a small testbed, the H2F has proven capable to deal with different network links properties, as it managed to take into account network parameters both for data distribution and job scheduling purposes
Researchers have been following two main approaches: (a) increasing the awareness of the heterogeneity of computing nodes and network links in improved versions of Hadoop (Geo-Hadoop approach); (b) adopting hierarchical frameworks where a single MapReduce job is split into many sub-jobs that are firstly sent to several nodes where they are elaborated as plain Hadoop jobs and whose results are sent back to a coordinator who merges them (Hierarchical approach)

Summary

Introduction

The continuous increase in data generated and captured by organizations for very different purposes has attracted great attention from both academia and industry. Much effort has been put into devising effective solutions for crunching Big Data in a single, yet powerful cluster of nodes [4], there has lately been a growing demand for processing and analyzing data that are generated and stored across geo-distributed data centers, to meet the emerging challenges of green and sustainable computing [5] Such a scenario calls for innovative and smarter computing frameworks capable of coping with a tough environment characterized by medium-to-big sized data volume spread across multiple heterogeneous data centers connected to each other via geographic network links.

Background and Motivating Scenario

System Architecture

Result

Jobs Scheduling

Modeling Job’s Execution Paths

Optimal Data Fragmentation

Experiment

Concluding Remarks on Performance Results

Literature Review

Geo-Hadoop Approach

Hierarchical Approach

Comparative Analysis

Findings

Conclusions and Final Remarks

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Big data and cognitive computing	Publication Date: Jan 6, 2022
Citations: 1	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

A Hierarchical Hadoop Framework to Process Geo-Distributed Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Big data and cognitive computing

Lead the way for us

Similar Papers

Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers
Tamer Z Emara ... Joshua Zhexue Huang
IEEE access : practical innovations, open solutions | VOL. 8
Tamer Z Emara, et. al.Tamer Z Emara ... Joshua Zhexue Huang
01 Jan 2020
IEEE access : practical innovations, open solutions | VOL. 8

Joint Workload Scheduling in Geo-Distributed Data Centers Considering UPS Power Losses
Guisen Ye ... Jingyang Fang
IEEE transactions on industry applications | VOL. 59
Guisen Ye, et. al.Guisen Ye ... Jingyang Fang
01 Jan 2023
IEEE transactions on industry applications | VOL. 59

Geographically distributed data management to support large-scale data analysis
Tamer Z Emara ... Joshua Zhexue Huang
Scientific Reports | VOL. 13
Tamer Z Emara, et. al.Tamer Z Emara ... Joshua Zhexue Huang
18 Oct 2023
Scientific Reports | VOL. 13

Infrastructures and services for remote sensing data production management across multiple satellite data centers
Jie Zhang ... Dong Xu
Cluster Computing | VOL. 19
Jie Zhang, et. al.Jie Zhang ... Dong Xu
30 May 2016
Cluster Computing | VOL. 19

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Hierarchical Hadoop Framework to Process Geo-Distributed Big Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Big data and cognitive computing