Abstract
In recent years, big data analytics frameworks spring up rapidly. Meanwhile, it has become routine for large volumes of data to be generated, stored, and processed across geographically distributed datac enters. Network congestion generated by data transfers between networks becomes a major bottleneck to the overall performance of the system in a geo-distributed environment. Many existing methods usually process network congestion after they occurs, which does not solve the problem fundamentally. In this paper, we focus on the problem of predicting and avoiding network congestion in advance in a geo-distributed environment on Apache Spark, in terms of their job completion times. We formulate this problem as a runtime minimization problem, which is challenging to solve in practice due to a scene with different data centers. To address these challenges, we propose a model based on congestion-aware scheduling. In the model, we exploit SDN(Software-Defined Networking) to detect the data size of the data flow in advance from different data centers and then analyze the data characteristics, which predicts the flow that can generate network congestion in advance, so that we can draft two scheme for different flow. In addition, when we detect the network congestion, we choose a path with a greater bandwidth for the congestion flow. The approach can minimize network congestion, promote network utilization and improve system performance in a geo-distributed environment. As a highlight of this paper, we design and implement our proposed solution as a job scheduler based on Apache Spark, a modern data processing framework.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have