A threshold-based similarity measure for duplicate detection

Mohammadreza Ektefa,Fatimah Sidi,Sara Memar,Marzanah A Jabar,Hamidah Ibrahim,Abdullah Ramli

doi:10.1109/icos.2011.6079233

Abstract

In order to extract beneficial information and recognize a particular pattern from huge data stored in different databases with different formats, data integration is essential. However the problem that arises here is that data integration may lead to duplication. In other words, due to the availability of data in different formats, there might be some records which refer to the same entity. Duplicate detection or record linkage is a technique which is used to detect and match duplicate records which are generated in data integration process. Most approaches concentrated on string similarity measures for comparing records. However, they fail to identify records which share the semantic information. So, in this study, a threshold-based method which takes into account both string and semantic similarity measures for comparing record pairs. This method is experimented on a real world dataset, namely Restaurant and its effectiveness is measured based on several standard evaluation metrics. As experimental results indicate, the proposed similarity method which is based on the combination of string and semantic similarity measures outperforms the individual similarity measures with the F-measure of 99.1% in Restaurant dataset. Therefore, based on experimental results, besides string similarity, semantic similarity should be considered in order to detect duplicate records more effectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A threshold-based similarity measure for duplicate detection

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A String Similarity Evaluation for Healthcare Ontologies Alignment to HL7 FHIR Resources
Athanasios Kiourtis ... Dimosthenis Kyriazis
-
Athanasios Kiourtis, et. al.Athanasios Kiourtis ... Dimosthenis Kyriazis
01 Jan 2019
01 Jan 2019

Duplicate bibliographic record detection with an OCR-converted source of information
Shoichi Taniguchi
Journal of Information Science | VOL. 39
Shoichi TaniguchiShoichi Taniguchi
15 Oct 2012
Journal of Information Science | VOL. 39

A Comparative Study in Classification Techniques for Unsupervised Record Linkage Model
Ektefa
Journal of Computer Science | VOL. 7
Ektefa Ektefa
01 Mar 2011
Journal of Computer Science | VOL. 7

SISR: System for integrating semantic relatedness and similarity measures
Mohamed Ben Aouicha ... Mohamed Ali Hadj Taieb
Soft Computing | VOL. 22
Mohamed Ben Aouicha, et. al.Mohamed Ben Aouicha ... Mohamed Ali Hadj Taieb
21 Nov 2016
Soft Computing | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A threshold-based similarity measure for duplicate detection

Abstract

Talk to us

Similar Papers