Abstract

One of the essential tasks in data integration is entity resolution (ER) which will recognize the records that are belonging to the same entity. The entity resolution is referred by many other terms like duplicate detection, pattern matching, etc. Now a days the activities like information integration, information retrieval, crowd sourcing, and pay-as-you-go have involved users to carry out the ER tasks such as to identify whether two entity descriptions are referred to the same entity or not. Previous work of ER involves clustering and comparison approaches which are based on some assumption. The ER gives the poorer quality when such assumptions are not correct. In our approach, we present a new set of entity rules where each rule enumerates all possibilities to identify the correct entity of the records. Additionally, we propose an extended approach (GenR) for efficient and effective rules generation by using a specialized form of term-based entropy measure. We experimentally evaluated the proposed approach using data set with a large no. of records and the data sets with different data characteristics. We report on some promising empirical results which demonstrate performance improvement by using a term-based quality measure.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call