The traditional MapReduce frameworks were originally designed for processing data within a single cluster and are not suitable for handling geo-distributed data. Consequently, alternative approaches such as Hierarchical and Geo-Hadoop have been proposed to address this limitation. However, these approaches still face challenges in efficiently managing inter-cluster data transfer, particularly considering the heterogeneity of clusters and varying bandwidth among them. Moreover, the need to transmit results to a central global reducer for geo-distributed MapReduce operations adds unnecessary complexity. To tackle these issues, we introduce Extended Cross-MapReduce (ECMR), a framework that integrates resource heterogeneity and network links in geo-distributed MapReduce workflows. ECMR optimizes data management and determines the necessary data volume for generating final results. To enhance performance, ECMR leverages the overlap between data transfer and execution time by utilizing multiple global reducers and grouping temporary results that require data transfer over the Internet. In ECMR, we propose a bipartite graph and extend the Gale-Shapley algorithm to determine the optimal number of clusters and select the most suitable locations for global reducers. Through extensive experimental evaluations conducted on a real testbed, we demonstrate the effectiveness of our proposed ECMR method. The results exhibit significant improvements over traditional Hierarchical and Geo-Hadoop approaches, achieving reductions of up to 81% and 85% in overall makespan, respectively.
Read full abstract