Online aggregation (OLA) makes it possible to save cost by taking acceptable approximate early answers. Compared to the precise results, computing the approximate ones are more cost effective, especially for large-scale datasets. The user can terminate the processing at any time, when he/she is satisfied with the quality of the result. And the performance of OLA relies on the sampling approach and estimation model. But in large scale distributed computing environment, how to realize OLA more efficiently is a challenging problem. In this paper, we consider the problem of providing OLA in the distributed computing environment and propose a Hadoop-based iterative sampling method for online aggregation. The desired precision of the user can be met by two iteration samplings. To avoid the effects of data bias, we propose a “layered sampling” method to ensure that the approximate aggregation result is statistically meaningful. The experimental results showed the “layered sampling” method considers not only the time efficiency, but also the usage of computing and storage resources of Hadoop.
Read full abstract