Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Methods for Approximate Query Processing (AQP) are essential for dealing with massive data. They are often the only means of providing interactive response times when exploring massive datasets, and are also needed to handle high speed data streams. These methods proceed by computing a lossy, compact synopsis of the data, and then executing the query of interest against the synopsis rather than the entire dataset. We describe basic principles and recent developments in AQP. We focus on four key synopses: random samples, histograms, wavelets, and sketches. We consider issues such as accuracy, space and time efficiency, optimality, practicality, range of applicability, error bounds on query answers, and incremental maintenance. We also discuss the trade-offs between the different synopsis types.
- Conference Article
- 10.1109/cbd.2016.020
- Aug 1, 2016
Sampling-based approximate query processing (AQP) method provides a fast way, in which the users can obtain a trade-off between accuracy and time consumption by executing the analytical application on a sample of data rather than the whole dataset. AQP method is usually adopted to support Big Data analysis efficiently, and there are two major AQP methods: (1) central limit theorem (CLT) based online aggregation, and (2) bootstrap method. The first one is time-efficient but only available for simple aggregation queries, while the second one is general but high computation overhead. Both two methods suffer from the possible estimation failure. However, there is no technology can not only supports more categories of queries but also has acceptable execution time. In order to make the current AQP method much more general and efficient, we propose a hybrid approximate query framework called AQP++ to combine the advantages of both methods together and eliminate the limitations as far as possible. We have implemented our AQP++ prototype and conducted extensive experiments on the TPC-H benchmark. Our results demonstrate the effectiveness and efficiency of our AQP++.
- Research Article
5
- 10.1016/j.jocs.2017.05.001
- May 29, 2017
- Journal of Computational Science
AQP++: a hybrid approximate query processing framework for generalized aggregation queries
- Conference Article
2
- 10.1109/wetice.2017.9
- Jun 1, 2017
The outsourcing of elaboration of data streams requires that a service provider collects and stores data on behalf of a company that does not have enough resources to sustain the efforts related to the management of such data streams. If a company does not trust the service provider, then it has to check the validity of the answers when querying the data store, since the results may be not reliable. In order to evaluate the answers, methods for approximate query processing can be used. These methods return fast answers based on data synopses. Such results can be used to validate those obtained by the providers on the basis of an accuracy estimation. In the paper, an extension of the traditional TPC-H benchmark has been used to compare three methods for approximate query processing, in order to verify the performance and the accuracy of the compared methods.
- Conference Article
68
- 10.1145/3183713.3183747
- May 27, 2018
Interactive analytics requires database systems to be able to answer aggregation queries within interactive response times. As the amount of data is continuously growing at an unprecedented rate, this is becoming increasingly challenging. In the past, the database community has proposed two separate ideas, sampling-based approximate query processing (AQP) and aggregate precomputation (AggPre) such as data cubes, to address this challenge. In this paper, we argue for the need to connect these two separate ideas for interactive analytics. We propose AQP++, a novel framework to enable the connection. The framework can leverage both a sample as well as a precomputed aggregate to answer user queries. We discuss the advantages of having such a unified framework and identify new challenges to fulfill this vision. We conduct an in-depth study of these challenges for range queries and explore both optimal and heuristic solutions to address them. Our experiments using two public benchmarks and one real-world dataset show that AQP++ achieves a more flexible and better trade-off among preprocessing cost, query response time, and answer quality than AQP or AggPre.
- Conference Article
115
- 10.1145/2882903.2915249
- Jun 14, 2016
Data volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By that produce an aggregate value for every combination of values in the grouping columns are the most important class of ad-hoc queries. As small errors are usually tolerable for such queries, approximate query processing (AQP) has the potential to answer them over very large datasets much faster. In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons. First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for summ queries. To address those limitations, we first introduce a new precision metric, called distribution precision, to express such error guarantees. We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a user-specified error bound. The main challenges are to provide rigorous error guarantees and to handle arbitrary highly selective predicates without maintaining large-sized samples. We propose a novel sampling scheme called measure-biased sampling to address the former challenge. For the latter, we propose two new indexes to augment in-memory samples. Like other sampling-based AQP techniques, our solution supports any aggregate that can be estimated from random samples. In addition to deriving theoretical guarantees, we conduct experimental study to compare our system with state-of-the-art AQP techniques and a commercial column-store database system on both synthetic and real enterprise datasets. Our system provides a median speed-up of more than 100x with around 5% distribution error compared with the commercial database.
- Conference Article
- 10.1109/pdp.2016.46
- Feb 1, 2016
Analysis of the existing techniques for approximate query processing of Big Data, based on sampling, histograms and wavelets, demonstrates that wavelet-based methods can be effectively utilized for OLAP purposes due to their advantages in terms of handling multidimensional data and querying single cells as well as aggregate values from a data warehouse. At the same time the current wavelet-based methods for approximate query processing have certain deficiencies making difficult to implement them in practice. In particular, most of the techniques struggle with arbitrarily size data either imposing a restriction on a dimension length to be a multiple of a power of two, or complicating decomposition algorithms what leads to the construction time increase and difficulties with error estimations. Also, there is a lack of methods for approximate processing based on wavelets with a bounded error and a confidence interval for both single and aggregate values. Our contribution in this paper is introduction of a new wavelet method for approximate query processing which handles arbitrarily sized multidimensional datasets with minor extra calculations and provides a bounded error of the single or aggregate value reconstruction. It is demonstrated that the new method allows evaluating a confidence interval of the query error depending on a given compression ratio of a data warehouse or performing an inverse task, i.e. evaluating the required data warehouse compression ratio for a given allowable error. The introduced method was applied and verified over real epidemiological datasets to support research in finding correlations and patterns in disease spread and clinical signs correlations. It was demonstrated that the accuracy of the estimated error is acceptable for retrieving single and aggregate value, query time processing advantage depends on compression ratio and volume of the processed data.
- Research Article
1
- 10.1007/s00778-006-0032-z
- Sep 29, 2006
- The VLDB Journal
On-line analytical processing (OLAP) has become an important component in most data warehouse systems and decision support systems in recent years. In order to deal with the huge amount of data, highly complex queries and increasingly strict response time requirements, approximate query processing has been deemed a viable solution. Most works in this area, however, focus on the space efficiency and are unable to provide quality-guaranteed answers to queries. To remedy this, in this paper, we propose an efficient framework of DCT for dAta With error estimatioN, called DAWN, which focuses on answering range-sum queries from compressed OP-cubes transformed by DCT. Specifically, utilizing the techniques of Geometric series and Euler's formula, we devise a robust summation function, called the GE function, to answer range queries in constant time, regardless of the number of data cells involved. Note that the GE function can estimate the summation of cosine functions precisely; thus the quality of the answers is superior to that of previous works. Furthermore, an estimator of errors based on the Brown noise assumption (BNA) is devised to provide tight bounds for answering range-sum queries. Our experiment results show that the DAWN framework is scalable to the selectivity of queries and the available storage space. With GE functions and the BNA method, the DAWN framework not only delivers high quality answers for range-sum queries, but also leads to shorter query response time due to its effectiveness in error estimation.
- Conference Article
6
- 10.1145/3221269.3223033
- Jul 9, 2018
Efficient array storage is the backbone of scientific data processing. With an explosion of data, rapidly answering queries on array data is becoming increasingly important. Although most of the array storages today support subsetting of an array based on dimensions efficiently, they fall back to full scan while executing value-based filter operations. This has lead to an interest in approximate query processing, but such methods can have substantial inaccuracies.This paper presents COMPASS, an array storage system with integrated value index support. Our approach efficiently encodes arrays as bin-based indices and corresponding residuals describing elements in each bin. Our query processing method uses bin-based indices, with residuals decompressed as needed, to ensure that accuracy is not sacrificed. Our evaluation shows that compared with current array storage systems such as SciDB, our method achieves a smaller storage footprint, but most importantly, can perform filter operations an order of magnitude faster on low selectivity queries. Meanwhile, COMPASS maintains comparable performance on high-selectivity queries or dimension-based subsetting operations.
- Book Chapter
8
- 10.1007/bfb0032451
- Mar 23, 1992
Statistical Databases usually allow only statistical queries. In order to answer a query some kind of summarization must be performed on the raw data. If the size of the original data is too large, e.g. as in Census data and the Current Population Survey, obtaining accurate answers is extremely time consuming. Thus, if the application allows for some precision loss in the answer, the mechanism for query answering could take advantage of previously computed summaries to answer other summary queries. In this paper we describe the necessary notions to maintain a database of previously computed summary information to allow fast query answering of new summary queries with a qualified accuracy and without having to go back to the original data. We use the concept of summary tables, study the potential of sets of summary tables for answering queries, and organize these sets in a lattice structure.
- Conference Article
14
- 10.1145/3219819.3219867
- Jul 19, 2018
The ability to identify insights from multi-dimensional big data is important for business intelligence. To enable interactive identification of insights, a large number of dimension combinations need to be searched and a series of aggregation queries need to be quickly answered. The existing approaches answer interactive queries on big data through data cubes or approximate query processing. However, these approaches can hardly satisfy the performance or accuracy requirements for ad-hoc queries demanded by interactive exploration. In this paper, we present BigIN4, a system for instant, interactive identification of insights from multi-dimensional big data. BigIN4 gives insight suggestions by enumerating subspaces and answers queries by combining data cube and approximate query processing techniques. If a query cannot be answered by the cubes, BigIN4 decomposes it into several low dimensional queries that can be directly answered by the cubes through an online constructed Bayesian Network and gives an approximate answer within a statistical interval. Unlike the related works, BigIN4 does not require any prior knowledge of queries and does not assume a certain data distribution. Our experiments on ten real-world large-scale datasets show that BigIN4 can successfully identify insights from big data. Furthermore, BigIN4 can provide approximate answers to aggregation queries effectively (with less than 10% error on average) and efficiently (50x faster than sampling-based methods).
- Research Article
38
- 10.14778/3407790.3407854
- Jul 1, 2020
- Proceedings of the VLDB Endowment
A private data federation enables clients to query the union of data from multiple data providers without revealing any extra private information to the client or any other data providers. Unfortunately, this strong end-to-end privacy guarantee requires cryptographic protocols that incur a significant performance overhead as high as 1,000 x compared to executing the same query in the clear. As a result, private data federations are impractical for common database workloads. This gap reveals the following key challenge in a private data federation: offering significantly fast and accurate query answers without compromising strong end-to-end privacy. To address this challenge, we propose SAQE, the Secure Approximate Query Evaluator, a private data federation system that scales to very large datasets by combining three techniques --- differential privacy, secure computation, and approximate query processing --- in a novel and principled way. First, SAQE adds novel secure sampling algorithms into the federation's query processing pipeline to speed up query workloads and to minimize the noise the system must inject into the query results to protect the privacy of the data. Second, we introduce a query planner that jointly optimizes the noise introduced by differential privacy with the sampling rates and resulting error bounds owing to approximate query processing. Our research shows that these three techniques are synergistic: sampling within certain accuracy bounds improves both query privacy and performance, meaning that SAQE executes over less data than existing techniques without sacrificing efficiency, privacy, or accuracy. Using our optimizer, we leverage this counter-intuitive result to identify an inflection point that maximizes all three criteria prior query evaluation. Experimentally, we show that this result enables SAQE to trade-off among these three criteria to scale its query processing to very large datasets with accuracy bounds dependent only on sample size, and not the raw data size.
- Research Article
3
- 10.1145/3589319
- Jun 13, 2023
- Proceedings of the ACM on Management of Data
Modern analytical engines rely on Approximate Query Processing (AQP) to provide faster response times than the hardware allows for exact query answering. However, existing AQP methods impose steep performance penalties as workload unpredictability increases. Specifically, offline AQP relies on predictable workloads to create samples that match the queries in a priori to query execution, reducing query response times when queries match the expected workload. As soon as workload predictability diminishes, existing online AQP methods create query-specific samples with little reuse across queries, producing significantly smaller gains in response times. As a result, existing approaches cannot fully exploit the benefits of sampling under increased unpredictability. We analyze sample creation and propose LAQy, a framework for building, expanding, and merging samples to adapt to the changes in workload predicates. We show the main parameters that affect the sample creation time and propose lazy sampling to overcome the unpredictability issues that cause fast-but-specialized samples to be query-specific. We evaluate LAQy by implementing it in an in-memory code-generation-based scale-up analytical engine to show the adaptivity and practicality of our framework in a modern system. LAQy speeds up online sampling processing as a function of sample reuse ranging from practically zero to full online sampling time and from 2.5x to 19.3x in a simulated exploratory workload.
- Book Chapter
- 10.1007/978-3-030-19807-7_33
- Jan 1, 2019
In today’s big data era, the capability of analyze massive data efficient and return the results within an short time limit is critical to decision making, thus many big data system proposed and various distributed and parallel processing techniques are heavily investigated. Among previous research, most of them are working on precise query processing, while approximate query processing (AQP) techniques which make interactive data exploration more efficiently and allows users to tradeoff between query accuracy and response time have not been investigate comprehensively. In this paper, we study the characteristics of aggregate query, a typical type of analytical query, and proposed an approximate query processing approach to optimize the execution of massive data based aggregate query with a histogram data structure. We implemented this approach into big data system Hive and compare it with Hive and AQP-enabled big data system BlinkDB, the experimental results verified that our approach is significantly fast than these existing systems in most scenarios.
- Conference Article
2
- 10.1109/icmlc.2006.258700
- Jan 1, 2006
Approximate query processing has emerged as a cost-effective approach for dealing with the large database system. Recent work has demonstrated the effectiveness of the wavelet transform in reducing large amounts of data to compact set of wavelet coefficients. In this paper, we extend related works about approximate query processing using wavelet. Two algorithms about how to do union and difference operations directly in wavelet domain are presented. Also we propose an algorithm on the direct update of wavelet coefficients when the original database is changed. The experimental results have shown that the accuracy of using wavelet is better than that of random sampling to do union and difference operations. And when the update amount of data is not too much, the direct update of wavelet is almost as good as the optimal selected wavelet synopses
- Research Article
4
- 10.14778/3538598.3538606
- May 1, 2022
- Proceedings of the VLDB Endowment
There has been an increasing demand for real-time data analytics. Approximate Query Processing (AQP) is a popular option for that because it can use random sampling to trade some accuracy for lower query latency. However, the state-of-the-art AQP system either relies on scan-based sampling algorithms to draw samples, which can still incur a non-trivial cost of table scan, or creates samples of the database in a preprocessing step, which are hard to update. The alternative is to use the aggregate B-tree indexes to support both random sampling and updates in database with logarithmic time. However, to the best of our knowledge, it is unknown how to design an aggregate B-tree to support highly concurrent random sampling and updates, due to the difficulty of maintaining the aggregate weights correctly and efficiently with concurrency. In this work, we identify the key challenges to achieve high concurrency and present AB-tree, an index for highly concurrent random sampling and update operations. We also conduct extensive experiments to show its efficiency and efficacy in a variety of workloads.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.