A proficient cost reduction framework for de-duplication of records in data integration.

Asif Sohail,Muhammad Murtaza Yousaf

doi:10.1186/s12911-016-0280-9

Asif Sohail, Muhammad Murtaza Yousaf

Open Access

https://doi.org/10.1186/s12911-016-0280-9

Copy DOI

Abstract

BackgroundRecord de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc.MethodsIn this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage).ResultsThe proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons.ConclusionsThe selection of the linkage key is a critical performance factor for record linkage.Electronic supplementary materialThe online version of this article (doi:10.1186/s12911-016-0280-9) contains supplementary material, which is available to authorized users.

Highlights

Record de-duplication is a process of identifying the records referring to the same entity
The best value for each of Single Key Blocking (SKB), Composite Key Blocking (CKB) and Multipass Blocking (MPB) is written in bold face and the worst value is written in italic
CKB made least number of record comparisons and still it identified an excellent number of matches

Summary

Introduction

Record de-duplication is a process of identifying the records referring to the same entity It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc. Sohail and Yousaf BMC Medical Informatics and Decision Making (2016) 16:42

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Apr 12, 2016
Citations: 29	License type: cc-by

R Discovery Prime

R Discovery Prime

A proficient cost reduction framework for de-duplication of records in data integration.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Towards the Ordering of Events from Multiple Textual Evidence Sources
Sarabjot Singh Anand ... Arshad Jhumka
International Journal of Digital Crime and Forensics | VOL. 3
Sarabjot Singh Anand, et. al.Sarabjot Singh Anand ... Arshad Jhumka
01 Apr 2011
International Journal of Digital Crime and Forensics | VOL. 3

Integration of multiple data sources to prioritize candidate genes using discounted rating system
Yongjin Li ... Jagdish C Patra
BMC Bioinformatics | VOL. 11
Yongjin Li, et. al.Yongjin Li ... Jagdish C Patra
01 Jan 2009
BMC Bioinformatics | VOL. 11

An intelligent web search framework for performing efficient retrieval of data
B Bazeer Ahamed ... T Ramkumar
Computers & Electrical Engineering | VOL. 56
B Bazeer Ahamed, et. al.B Bazeer Ahamed ... T Ramkumar
03 Oct 2016
Computers & Electrical Engineering | VOL. 56

The power of integrating multiple data sources in medical imaging: A study of MGMT methylation status
Mariya Miteva ... Maria Nisheva-Pavlova
Procedia Computer Science | VOL. 239
Mariya Miteva, et. al.Mariya Miteva ... Maria Nisheva-Pavlova
01 Jan 2024
Procedia Computer Science | VOL. 239

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A proficient cost reduction framework for de-duplication of records in data integration.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making