Efficient Record Linkage Algorithms Using Complete Linkage Clustering.

Abdullah-Al Mamun,Robert Aseltine,Sanguthevar Rajasekaran

doi:10.1371/journal.pone.0154446

Abdullah-Al Mamun, Robert Aseltine + Show 1 more

Open Access

https://doi.org/10.1371/journal.pone.0154446

Copy DOI

Journal: PLOS ONE	Publication Date: Apr 28, 2016
Citations: 23	License type: CC BY 4.0

Affiliation: University of Connecticut

Abstract

Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.

Highlights

Health agencies keep track of patients0 health information and at the same time records of a patient reside in multiple data sources
We have previously proposed single linkage hierarchical clustering based solutions [20] for this record linkage problem
The exact matching phase sometimes shrinks much-cleaner real data sets a lot by removing duplicate records

Summary

Introduction

Health agencies keep track of patients0 health information and at the same time records of a patient reside in multiple data sources. Our proposed algorithms are based on hierarchical clustering [24] This requires linkage criteria that define how distances are measured between any two clusters. The distance between two clusters A and B is computed as the minimum distance between a point (i.e., a record) in A and a point in B. The distance between two clusters A and B is computed as the maximum distance between a point in A and a point in B. We have used complete linkage hierarchical clustering for our algorithms These algorithms generally use edit distance, reversal edit distance and truncation edit distance calculation methods our algorithms can support any distance measure.

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Efficient Record Linkage Algorithms Using Complete Linkage Clustering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE

Lead the way for us

Similar Papers

Efficient sequential and parallel algorithms for record linkage
Abdullah-Al Mamun ... Tian Mi
Journal of the American Medical Informatics Association | VOL. 21
Abdullah-Al Mamun, et. al.Abdullah-Al Mamun ... Tian Mi
01 Mar 2014
Journal of the American Medical Informatics Association | VOL. 21

RLT-S: A Web System for Record Linkage.
Abdullah-Al Mamun ... Sanguthevar Rajasekaran
PLOS ONE | VOL. 10
Abdullah-Al Mamun, et. al.Abdullah-Al Mamun ... Sanguthevar Rajasekaran
05 May 2015
PLOS ONE | VOL. 10

Efficient Sequential and Parallel Algorithms for Incremental Record Linkage
Abdullah Baihan ... Reda Ammar
-
Abdullah Baihan, et. al.Abdullah Baihan ... Reda Ammar
01 Jan 2020
01 Jan 2020

Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis
Qi Mao ... Volker Mai
-
Qi Mao, et. al.Qi Mao ... Volker Mai
01 Nov 2015
01 Nov 2015

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Efficient Record Linkage Algorithms Using Complete Linkage Clustering.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLOS ONE