Abstract

Inequality join is an operator to join data on inequality conditions and it is a fundamental building block for applications. While methods and optimizations exist for efficient inequality join in batch processing, little attention has been given to its streaming version, particularly to large-scale data-intensive applications that run on Distributed Stream Processing Systems (DSPSs). Designing an inequality join in streaming and distributed settings is not an easy task: (i) indexes have to be employed to efficiently support inequality-based comparisons, but the continuous stream of data imposes continuous insertions, updates, and deletions of elements in the indexes—hence a huge overhead for the DSPSs; (ii) oftentimes real data is skewed, which makes indexing even more challenging.To address these challenges, we propose the Stream-Aware inequality join (STA), an indexing method that can reduce redundancy and index update overhead. STA builds a separate in-memory index structure for hotkeys, i.e., the most frequently used keys, which are automatically identified with an efficient data sketch. On the other hand, the cold keys are treated using a linked set of index structures. In this way, STA avoids many superfluous index updates for frequent items. Finally, we implement four state-of-the-art inequality join solutions for a widely employed DSPS (Apache Storm) and compare their performance with STA on four real-world data sets and a synthetic one. The results of our experimental evaluation reveal that our stream-aware approach outperforms existing solutions.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.