Abstract

In this paper, we present an I/O efficient distributed approximation framework to support approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample-based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample-based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low I/O efficiency and poor estimation accuracy on unevenly distributed sub-datasets. To address the problem, we develop a distribution aware method called CLAP (cluster sampling based approximation). Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to enable I/O efficient online sampling. There are three thrusts in CLAP. First, we develop a probabilistic map to reduce the exponential number of recorded sub-datasets to a linear one. Second, we apply the cluster sampling with unequal probability theory to implement a distribution-aware method for efficient online sampling for a single or multiple sub-datasets. Third, we enrich CLAP support with more complex approximations such as ratio and regression using bootstrap based estimation beyond the simple aggragation approxiamtions. Forth, we add an option in CLAP to allow users specifying a target error bound when submitting an approximation job. Fifth, we quantitatively derive the optimal sampling unit size in a distributed file system by associating it with approximation costs and accuracy. We have implemented CLAP into Hadoop as an example system and open sourced it on GitHub. Our comprehensive experimental results show that CLAP can achieve a speedup by up to 20× over the precise execution.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.