Prefix filtering with data partitioning for similarity join

Methus Bhirakit,Jaruloj Chongstitvatana

doi:10.1109/icsec.2013.6694772

Abstract

Many applications, such as data integration, and data preparation, use similarity join as an important operation. In real-world applications, the challenge of similarity joins arises when data sets are large. Filter and verify methods have been proposed to reduce the running time of similarity join. The prefix filtering algorithm, which is one of the filter and verify methods, filters out some dissimilar strings by examining only the prefix of strings, instead of the whole strings. In this paper, we propose the data partitioning for prefix filtering method using in similarity join. For our approach, the database is divided into partitions and prefix filtering is performed for each partition of data. This proposed algorithm supports parallelism because filtering can be done on each partition independently. Furthermore, when the dataset is partitioned into smaller sets, a proper prefix length can be determined for each data partition. This also improves the selection of candidate strings, and reduces the verify time. An experiment is performed to compare the proposed algorithm to state-of-the-art algorithms. The experiment shows that our method achieves higher performance by reducing in the number of candidate strings and parallel execution.

Full Text