Abstract
Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operator has a wide range of applications in data cleaning, data mining, etc. In this paper, we employ a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure load balancing, we present two sampling based partition methods, i.e., clustering based partition method and KD-tree based partition method. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where plane sweeping and pivot based filtering techniques are utilized for pruning. Extensive experiments confirm that our solution outperforms significantly existing state-of-the-art competitors.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.