Abstract

Online aggregation (OLA) makes it possible to save cost by taking acceptable approximate early answers. Compared to the precise results, computing the approximate ones are more cost effective, especially for large-scale datasets. The user can terminate the processing at any time, when he/she is satisfied with the quality of the result. And the performance of OLA relies on the sampling approach and estimation model. But in large scale distributed computing environment, how to realize OLA more efficiently is a challenging problem. In this paper, we consider the problem of providing OLA in the distributed computing environment and propose a Hadoop-based iterative sampling method for online aggregation. The desired precision of the user can be met by two iteration samplings. To avoid the effects of data bias, we propose a “layered sampling” method to ensure that the approximate aggregation result is statistically meaningful. The experimental results showed the “layered sampling” method considers not only the time efficiency, but also the usage of computing and storage resources of Hadoop.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.