Abstract

Entity resolution (ER) is the process of identifying which tuples/records in a dataset refer to the same real-world entity. In this paper, we primarily focus on the deduplication task for n-dimensional datasets that contain real vectors with the dimensionality 2 ≤ n ≤ 784. By presenting an algorithm with complete and disjoint space partitioning, the ER-Index with a region-tree and sorted lists will be created by a sample set of a (dirty) dataset. We utilize primary horizontal fragmentation to partition a dataset into a set of fragments, and then resolve the fragments one by one applying the ER-Index and algorithms. Extensive experiments are conducted to demonstrate the performances of our proposed approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call