Metric Similarity Joins Using MapReduce

Gang Chen,Keyu Yang,Chun Chen,Baihua Zheng,Yunjun Gao,Lu Chen

doi:10.1109/tkde.2016.2631599

Gang Chen, Keyu Yang + Show 4 more

Open Access

https://doi.org/10.1109/tkde.2016.2631599

Copy DOI

Abstract

Given two object sets Q and O, a metric similarity join finds similar object pairs according to a certain criterion. This operation has a wide variety of applications in data cleaning and data mining, to name but a few. However, the rapidly growing volume of data nowadays challenges traditional metric similarity join methods, and thus, a distributed method is required. In this paper, we adopt a popular distributed framework, namely, MapReduce, to support scalable metric similarity joins. To ensure the load balancing, we present two sampling based partition methods. One utilizes the pivot and the space-filling curve mappings to cluster the data into one-dimensional space, and then selects high quality centroids to enable equal-sized partitions. The other uses the KD-tree partitioning technique to equally divide the data after the pivot mapping. To avoid unnecessary object pair evaluation, we propose a framework that maps the two involved object sets in order, where the range-object filtering, the double-pivot filtering, the pivot filtering, and the plane sweeping techniques are utilized for pruning. Extensive experiments with both real and synthetic data sets demonstrate that our solutions outperform significantly existing state-of-the-art competitors.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Transactions on Knowledge and Data Engineering	Publication Date: Dec 26, 2016
Citations: 27	License type: cc-by-nc-nd

R Discovery Prime

R Discovery Prime

Metric Similarity Joins Using MapReduce

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering

Lead the way for us

Similar Papers

Metric Similarity Joins Using MapReduce (Extended Abstract)
Gang Chen ... Chun Chen
-
Gang Chen, et. al.Gang Chen ... Chun Chen
01 Apr 2018
01 Apr 2018

Continuous outlier detection on uncertain data streams
Salman Ahmed Shaikh ... Hiroyuki Kitagawa
-
Salman Ahmed Shaikh, et. al.Salman Ahmed Shaikh ... Hiroyuki Kitagawa
01 Apr 2014
01 Apr 2014

Author response: Limitations of principal components in quantitative genetic association models for human studies
Yiqi Yao ... Alejandro Ochoa
-
Yiqi Yao, et. al.Yiqi Yao ... Alejandro Ochoa
25 Apr 2023
25 Apr 2023

Decision letter: Limitations of principal components in quantitative genetic association models for human studies
Magnus Nordborg ... Detlef Weigel
-
Magnus Nordborg, et. al.Magnus Nordborg ... Detlef Weigel
04 Jul 2022
04 Jul 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Metric Similarity Joins Using MapReduce

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Knowledge and Data Engineering