Abstract
>Sampling is one of the most widely employed approximations in big data processing. Among various challenges in sampling design, sampling for join is particularly intriguing yet complex. This perplexing problem starts with a classical case where the join of two Bernoulli samples shrinks its output size quadratically and exhibits a strong dependency on the input data, presenting a unique challenge that necessitates adaptive sampling to guarantee both the quantity and quality of the sampled data. The community has made strides in achieving this goal by constructing offline samples and integrating support from indexes or key frequencies. However, when dealing with stream data, due to the need for real-time processing and high-quality analysis, methods developed for processing static data become unavailable. Consequently, a fundamental question arises: Is it possible to achieve adaptive sampling in stream data without relying on offline techniques? To address this problem, we propose FreeSam, which couples hybrid sampling with intra-window join, a key stream join operator. Our focus lies on two widely used metrics: output size, ensuring quantity, and variance, ensuring quality. FreeSam enables adaptability in both the desired quantity and quality of data sampling by offering control on the two-dimensional space spanned by these metrics. Meanwhile, adjustable trade-offs between quality and performance make FreeSam practical for use. Our experiments show that, for every 1% increase in latency limitation, FreeSam can yield a 3.83% increase in the output size while maintaining the level of the estimator's variance. Additionally, we give FreeSam a multi-core implementation and ensure predictability of its latency through both an analytic model and a neural network model. The accuracy of these models is 88.05% and 96.75% respectively.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.