Geo-distributed Bigdata processing is increasing day by day, resulting in the origins of data that are geographically distributed in different countries and hold datacenters (DCs) across the globe, and also the applications that use different sites to increase reliability, security, and processing performances. Most popular frameworks like Hadoop and Spark are re-designed to process geographically distributed data at their locations. However, these methods still suffer from a large amount of data transfer over the Internet, which prohibits a high processing time and cost for many applications, and in several cases, the output results of the computation are smaller than its inputs. In this paper, we keep the data locality principle for processing data at different locations but ignore the principle of transferring the entire intermediate results to a single global reducer. We propose Geo-MR, an intelligent geo-distributed MapReduce-based framework across federated cloud based on two heuristic algorithms: (i) chosen the best clusters as global reducers to reduce the communication and optimize the transfer on the bandwidth, GResearch. (ii) The second, Geo-MR, ensures the scheduling of only the relevant data to selected global reducers that process the final results. As a baseline, we propose an exact MapReduce scheduling model for benchmarking and to compare and discuss the Geo-MR heuristic algorithm results. The experimental results show that the proposed algorithm Geo-MR can improve resource (bandwidth and VMs of clusters) utilization of the cloud federation and consequently reduce cost and job response time.
Read full abstract