Abstract

Stream-based join algorithms are a promising technology for modern real-time data-warehouses. A particular category of stream-based joins is a semi-stream join where a single stream is joined with a disk based master data. The join operator typically works under limited main memory and this memory is generally not large enough to hold the whole disk-based master data. Recently, a seminal join algorithm called MESHJOIN (Mesh Join) has been proposed in the literature to process semi-stream data. MESHJOIN is a candidate for a resource-aware system setup. However, MESHJOIN is not very selective. In particular, MESHJOIN does not consider the characteristics of stream data and its performance is suboptimal for skewed stream data. In this paper I propose a novel Cached-based Semi-Stream Join (CSSJ) using a cache module. The algorithm is more appropriate for skewed distributions, and I present results for Zipfian distributions of the type that appear in many applications. I conduct a rigorous experimental study to test our algorithm. Our experiments show that CSSJ outperforms MESHJOIN significantly. I also present the cost model for our CSSJ and validate it with experiments.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call