>Sampling is one of the most widely employed approximations in big data processing. Among various challenges in sampling design, sampling for join is particularly intriguing yet complex. This perplexing problem starts with a classical case where the join of two Bernoulli samples shrinks its output size quadratically and exhibits a strong dependency on the input data, presenting a unique challenge that necessitates adaptive sampling to guarantee both the quantity and quality of the sampled data. The community has made strides in achieving this goal by constructing offline samples and integrating support from indexes or key frequencies. However, when dealing with stream data, due to the need for real-time processing and high-quality analysis, methods developed for processing static data become unavailable. Consequently, a fundamental question arises: Is it possible to achieve adaptive sampling in stream data without relying on offline techniques? To address this problem, we propose FreeSam, which couples hybrid sampling with intra-window join, a key stream join operator. Our focus lies on two widely used metrics: output size, ensuring quantity, and variance, ensuring quality. FreeSam enables adaptability in both the desired quantity and quality of data sampling by offering control on the two-dimensional space spanned by these metrics. Meanwhile, adjustable trade-offs between quality and performance make FreeSam practical for use. Our experiments show that, for every 1% increase in latency limitation, FreeSam can yield a 3.83% increase in the output size while maintaining the level of the estimator's variance. Additionally, we give FreeSam a multi-core implementation and ensure predictability of its latency through both an analytic model and a neural network model. The accuracy of these models is 88.05% and 96.75% respectively.
Read full abstract