Abstract
The presence of duplicate records is a major data quality concern in huge datasets. To detect duplicates, entity matching is used as an essential step of the data cleaning process to map records that refer to the same real-world entity. Most of proposed algorithms require labeled data in order to train a classifier. However, we cannot always obtain labeled data. In our paper we propose an unsupervised approach for entity matching problem using an improved version of genetic algorithm. We explain the main improvements added to genetic algorithm and the encoding strategy to encode partitions in the form of a chromosome. Different similarity functions are used to compute similarities between records. The obtained results prove that our proposition stands as a powerful approach in the entity matching field where it outperforms the traditional genetic algorithm based approach.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have