Network Cost-Aware Geo-Distributed Data Analytics System

Kwangsung Oh,Jon Weissman,Minmin Zhang,Abhishek Chandra

doi:10.1109/tpds.2021.3108893

Kwangsung Oh, Jon Weissman + Show 2 more

https://doi.org/10.1109/tpds.2021.3108893

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Many geo-distributed data analytics (GDA) systems have focused on the network performance-bottleneck: inter-data center network bandwidth to improve performance. Unfortunately, these systems may encounter a <i>cost-bottleneck</i> ( <inline-formula><tex-math notation="LaTeX">${\$}$</tex-math></inline-formula> ) because they have not considered data transfer cost ( <inline-formula><tex-math notation="LaTeX">${\$}$</tex-math></inline-formula> ), one of the most expensive and heterogeneous resources in a multi-cloud environment. In this article, we present <i>Kimchi</i> , a network cost-aware GDA system to meet the cost-performance tradeoff by exploiting data transfer cost heterogeneity to avoid the cost-bottleneck. Kimchi determines cost-aware task placement decisions for scheduling tasks given inputs including data transfer cost, network bandwidth, input data size and locations, and desired cost-performance tradeoff preference. In addition, Kimchi is also mindful of data transfer cost in the presence of dynamics. Kimchi has been applied to two common GDA MapReduce models: synchronous barrier and asynchronous push-based shuffle. A Kimchi prototype has been implemented on Spark, and experiments show that it reduces cost by 5% <inline-formula><tex-math notation="LaTeX">$\scriptstyle \sim$</tex-math></inline-formula> 24% without impacting performance and reduces query execution time by 45% <inline-formula><tex-math notation="LaTeX">$\scriptstyle \sim$</tex-math></inline-formula> 70% without impacting cost compared to other baseline approaches centralized, vanilla Spark, and bandwidth-aware (e.g., Iridium). More importantly, Kimchi allows applications to explore a much richer cost-performance tradeoff space in a multi-cloud environment.

Full Text