Scalable and approximate privacy-preserving record linkage

Dinusha Vatsalan

doi:10.25911/5d739004a7846

Abstract

Record linkage, the task of linking multiple databases with the aim to identify records that refer to the same entity, is occurring increasingly in many application areas. Generally, unique entity identifiers are not available in all the databases to be linked. Therefore, record linkage requires the use of personal identifying attributes, such as names and addresses, to identify matching records that need to be reconciled to the same entity. Often, it is not permissible to exchange personal identifying data across different organizations due to privacy and confidentiality concerns or regulations. This has led to the novel research area of privacy-preserving record linkage (PPRL). PPRL addresses the problem of how to link different databases to identify records that correspond to the same real-world entities, without revealing the identities of these entities or any private or confidential information to any party involved in the process, or to any external party, such as a researcher. The three key challenges that a PPRL solution in a real-world context needs to address are (1) scalability to large databases by efficiently conducting linkage; (2) achieving high quality of linkage through the use of approximate (string) matching and effective classification of the compared record pairs into matches (i.e. pairs of records that refer to the same entity) and non-matches (i.e. pairs of records that refer to different entities); and (3) provision of sufficient privacy guarantees such that the interested parties only learn the actual values of certain attributes of the records that were classified as matches, and the process is secure with regard to any internal or external adversary. In this thesis, we present extensive research in PPRL, where we have addressed several gaps and problems identified in existing PPRL approaches. First, we begin the thesis with a review of the literature and we propose a taxonomy of PPRL to characterize existing techniques. This allows us to identify gaps and research directions. In the remainder of the thesis, we address several of the identified shortcomings. One main shortcoming we address is a framework for empirical and comparative evaluation of different PPRL solutions, which has not been studied in the literature so far. Second, we propose several novel algorithms for scalable and approximate PPRL by addressing the three main challenges of PPRL. We propose efficient private blocking techniques, for both three-party and two-party scenarios, based on sorted neighborhood clustering to address the scalability challenge. Following, we propose two efficient two-party techniques for private matching and classification to address the linkage quality challenge in terms of approximate matching and effective classification. Privacy is addressed in these approaches using efficient data perturbation techniques including k-anonymous mapping, reference values, and Bloom filters. Finally, the thesis reports on an extensive comparative evaluation of our proposed solutions with several other state-of-the-art techniques on real-world datasets, which shows that our solutions outperform others in terms of all three key challenges.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Scalable and approximate privacy-preserving record linkage

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

A Scalable Blocking Framework for Multidatabase Privacy-preserving Record Linkage

-

01 Jan 2017
01 Jan 2017

Secure Privacy Preserving Record Linkage of Large Databases by Modified Bloom Filter Encodings.
Rainer Schnell ... Christian Borgs
International journal of population data science | VOL. 1
Rainer Schnell, et. al.Rainer Schnell ... Christian Borgs
13 Apr 2017
International journal of population data science | VOL. 1

Evaluating privacy-preserving record linkage using cryptographic long-term keys and multibit trees on large medical datasets
Adrian P Brown ... Christian Borgs
BMC Medical Informatics and Decision Making | VOL. 17
Adrian P Brown, et. al.Adrian P Brown ... Christian Borgs
08 Jun 2017
BMC Medical Informatics and Decision Making | VOL. 17

Precise and Fast Cryptanalysis for Bloom Filter Based Privacy-Preserving Record Linkage
Peter Christen ... Rainer Schnell
IEEE Transactions on Knowledge and Data Engineering | VOL. 31
Peter Christen, et. al.Peter Christen ... Rainer Schnell
01 Nov 2019
IEEE Transactions on Knowledge and Data Engineering | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Scalable and approximate privacy-preserving record linkage

Abstract

Talk to us

Similar Papers