Abstract

BackgroundRecord de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc.MethodsIn this paper, we have proposed a framework that employs blocking and windowing techniques in succession, such that figuring out the parameters is not required. We have also evaluated the impact of different configurations on dirty and massively dirty datasets. To evaluate the proposed framework, experiments are performed using Febrl (Freely Extensible Biomedical Record Linkage).ResultsThe proposed framework is comprehensively evaluated using a variety of quality and complexity parameters such as reduction ratio, precision, recall etc. It is observed that the proposed framework significantly reduces the number of record comparisons.ConclusionsThe selection of the linkage key is a critical performance factor for record linkage.Electronic supplementary materialThe online version of this article (doi:10.1186/s12911-016-0280-9) contains supplementary material, which is available to authorized users.

Highlights

  • Record de-duplication is a process of identifying the records referring to the same entity

  • The best value for each of Single Key Blocking (SKB), Composite Key Blocking (CKB) and Multipass Blocking (MPB) is written in bold face and the worst value is written in italic

  • CKB made least number of record comparisons and still it identified an excellent number of matches

Read more

Summary

Introduction

Record de-duplication is a process of identifying the records referring to the same entity It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the commonly used methods for reducing the number of record comparisons during record de-duplication. Both blocking and windowing require tuning of a certain set of parameters, such as the choice of a particular variant of blocking or windowing, the selection of appropriate window size for different datasets etc. Sohail and Yousaf BMC Medical Informatics and Decision Making (2016) 16:42

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.