Query Completion Time Research Articles

In data centers, the occurrence of timeout for TCP may hurt its data transmission performance dramatically, causing problems like TCP Incast, TCP Outcast and long query completion time. To mitigate timeouts, the transport protocol should try to maintain a small switch queue to avoid the packet loss and recover lost packets quickly. Recent work suggests using Explicit Congestion Notification (ECN), Round Trip Time (RTT) or the in-network signal to achieve that. However, these solutions either still suffer from many timeouts when the number of concurrent flows becomes larger or require the nontrivial hardware support. The limitations motivate us to design a Priority-based Transport Control Protocol termed PTCP to mitigate timeouts as far as possible for commodity data center. The key idea of PTCP is that it inserts a high priority packet following each window of data packets. The key insight is that since the priorities of data packets and the inserted packet are different, they may arrive at the receiver in different sequences depending on the network congestion. By checking the sequences of the received ACKs of the two kinds of packets, PTCP can infer the network congestion to guide the fine adjustment of its sending window such that the switch buffer occupation is kept small. Additionally, by keeping the high priority packet always in flight, PTCP could determine to retransmit the possible lost data packets quickly. With the two mechanisms, PTCP significantly alleviates timeouts even when the number of concurrent flows becomes large. Furthermore, PTCP only requires the priority queuing function of switch, which is available in existing commodity switch hardware. Thus, it does not require the hardware modification. Extensive performance evaluation is conducted to demonstrate that PTCP has zero timeout and better performance for problems like TCP Incast, TCP Outcast and long query completion time compared with several state-of-the-art protocols.

Read full abstract

Analytic queries are typically compiled into execution plans in the form of directed acyclic graphs (DAGs) of MapReduce jobs. Jobs in the DAGs are dispatched to the MapReduce processing engine as soon as their dependencies are satisfied. MapReduce adopts a job-level scheduling policy to strive for a balanced distribution of tasks and effective utilization of resources. However, such simplistic policy is unable to reconcile the dynamics of different jobs in complex analytic queries, resulting in unfair treatment of different queries, low utilization of system resources, prolonged execution time, and low query throughput. Therefore, we introduce a scheduling framework to address these problems systematically. Our framework includes two techniques: multivariate DAG modeling and two-level query scheduling. Cross-layer semantics percolation allows the flow of query semantics and job dependencies in the DAG to the MapReduce scheduler. With richer semantics information, we build a multivariate model that can accurately predict the execution time of individual MapReduce jobs and gauge the changing size of analytics datasets through selectivity approximation. Furthermore, we introduce two-level query scheduling that can maximize the intra-query job-level concurrency, and at the same time speed up the query-level completion time based on the accurate prediction and queuing of queries. At the job level, we focus on detecting query semantics, predicting the query completion time through an online multivariate linear regression model, thereby increasing job-level parallelism and maximizing data sharing across jobs. At the task level, we focus on balanced data distribution, maximal slot utilization, and optimal data locality of task scheduling. Our experimental results on a set of complex query benchmarks demonstrate that our scheduling framework can significantly improve both fairness and throughput of Hive queries. It can improve query response time by up to 43.9% and 72.8% on average, compared to the Hadoop Fair Scheduling and the Hadoop Capacity Scheduling, respectively. In addition, our two-level scheduler can achieve a query fairness that is, on average, 59.8% better than that of the Hadoop Fair Scheduler.

Read full abstract

Query Completion Time Research Articles

Related Topics

Articles published on Query Completion Time

Practical Packet Deflection in Datacenters

Turbo: Dynamic and Decentralized Global Analytics via Machine Learning

Observing and Mitigating Micro-Burst Traffic in Data Center Networks

PTCP: A priority-based transport control protocol for timeout mitigation in commodity data center

Multivariate modeling and two-level scheduling of analytic queries

SLA Definition for Multi-tenant DBMS and its Impact on Query Optimization

Evaluating visual query methods for articulated motion video search

Datacenter Applications in Virtualized Networks: A Cross-Layer Performance Study

Don't drop, detour!

Optimization of sub-query processing in distributed data integration systems

ROAR

Site and query scheduling policies in multicomputer database systems

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Query Completion Time Research Articles

Related Topics

Articles published on Query Completion Time

Practical Packet Deflection in Datacenters

Turbo: Dynamic and Decentralized Global Analytics via Machine Learning

Observing and Mitigating Micro-Burst Traffic in Data Center Networks

PTCP: A priority-based transport control protocol for timeout mitigation in commodity data center

Multivariate modeling and two-level scheduling of analytic queries

SLA Definition for Multi-tenant DBMS and its Impact on Query Optimization

Evaluating visual query methods for articulated motion video search

Datacenter Applications in Virtualized Networks: A Cross-Layer Performance Study

Don't drop, detour!

Optimization of sub-query processing in distributed data integration systems

ROAR

Site and query scheduling policies in multicomputer database systems