Abstract
A sampling-based approximate query processing (AQP) method provides a fast way for users to obtain a trade-off between accuracy and time consumption by executing the query on a sample of data rather than the whole dataset. There are two major AQP methods: the (1) central limit theorem (CLT)-based online aggregation; and the (2) bootstrap method. The former is very efficient but is only suitable for simple aggregation queries, while the latter is quite general but has relatively high computational overhead. Both methods suffer from the possible estimation failure. However, there is no technology that can both support simple/complex queries within an acceptable time coupled with carefully considering the estimation failure. To make the current AQP method much more general and efficient, we propose a hybrid approximate query framework called AQP++ to combine the advantages of both methods and eliminate the limitations as far as possible. According to this hybrid framework, an estimation parameters adjustment method is presented for CLT-based online aggregation to improve its usability for much more complex aggregation queries. Then, an execution cost model is proposed to describe the computational overhead of the two AQP methods, which can be used to support our dynamic scheduling mechanism of AQP++ and make the whole system more efficient and flexible. Moreover, we have implemented our AQP++ prototype and conducted extensive experiments on the TPC-H benchmark for skewed data distribution. Our results demonstrate that our AQP++ can produce acceptable approximate results for both simple and complex queries within a much shorter time compared with the original CLT-based online aggregation and bootstrap method.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have