Abstract
Modern enterprises often manage geographically distributed datacenters around the globe. In such environment, datasets are naturally collected and stored in different data centers and were later queried for complex analytics. In this paper, we study the Wide-Area Data Analytics problem, which aims to efficiently control data movements and achieve low latency for overall queries processing, both constrained by limited and expensive network resources across datacenters. Previous papers focus on offline settings of single analytical queries and do not consider time in optimizing system performance, and therefore ignores the dynamics of data and task placement in terms of inter-DC bandwidth utilization. In this paper, we consider the online setting and formulate a cost-minimizing optimization problem over time for arbitrary Directed Acyclic Graph query processing. Considering dynamics of network resource usage, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed Horizon Control (MFHC) with good competitive ratios. We performed extensive simulations and comparative studies using the TPC-CH benchmark and verified the efficacy of proposed algorithms. The algorithm we proposed is better than the existing algorithm, and its performance approximates the theoretical optimal value.
Highlights
Nowadays, many large organizations and enterprises produce massive volumes of data that is approaching scale
DAG can visually depict the time course of data transmission. In such environment, processing analytical queries in the format of Directed Acyclic Graph (DAG) of operators brings a unique challenge: how to efficiently process the query DAGs to achieve customized Quality of Service (QoS) in presence of network resource constraints? Since inter-DC network bandwidth is often limited and expensive, the QoS for analytical query processing is largely impacted by utilization of network resources
To address the above challenges, we developed two online algorithms, Online Switch Resist (OSR) and Most Fixed
Summary
Many large organizations and enterprises produce massive volumes of data that is approaching scale (e.g., petabytes of user additivity logs and server monitoring data per day). DAG can visually depict the time course of data transmission In such environment, processing analytical queries in the format of Directed Acyclic Graph (DAG) of operators brings a unique challenge: how to efficiently process the query DAGs to achieve customized Quality of Service (QoS) in presence of network resource constraints? Since inter-DC network bandwidth is often limited and expensive, the QoS for analytical query processing is largely impacted by utilization of network resources Such challenges define a new problem setting, referred to as the Wide-Area Data Analytics (WADA) [3] or Global Analytics [4], as illustrated, that becomes a recent research hotspot driven. One important observation we considered in this work is that, applying existing algorithms repeatedly may cause data movements back and forth, wasting expensive inter-DC network bandwidth (see examples in Section III-B after relevant cost components are defined).
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.