Abstract

In the era of data deluge, data analysis has become a key task for many industrial applications, e.g., master data management, and data integration. In particular, similarity join is an important primitive operator to support data analysis, which is to find similar pairs based on similarity functions and thresholds. In this article, we first propose a new similarity join operation called the dynamic skyline join without having to specify any similarity function or similarity threshold, which measures the similarity through multicriteria optimization. The dynamic skyline join operator makes the similarity join more flexible to support different criteria in multidimensional space. However, it is nontrivial to achieve dynamic skyline joins as both join operations and dynamic skyline queries are computationally complex in the increasing volume of real-world data. Therefore, we further propose Grid-SkyJoin, a framework to enable efficient parallel dynamic skyline joins on a shared-nothing cluster. Specifically, we use a grid partitioning to facilitate the data filtering and grouping strategies to provide load balancing and reduce the number of replicas. We also propose a multilevel filtering scheme to prune away a large fraction of unpromising points that do not fit into join results without actual join operations. Extensive experiments using benchmark datasets demonstrate that our filtering scheme can greatly reduce the number of data points to be joined, and our approach is about two times faster compared with the straightforward method in average.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call