Abstract

Linkage of records that refer to the same entity across different databases finds applications in several areas, including healthcare, business, national security, and government services. In the absence of unique identifiers, quasi-identifiers (e.g. name, age, address) must be used to identify records of the same entity in different databases. These quasi-identifiers (QIDs) contain personal identifiable information (PII). Therefore, record linkage must be conducted in a way that preserves privacy. Using Cryptographic Long-term Key (CLK)-based encoding is one popular privacy-preserving record linkage (PPRL) technique where different QIDs are encoded independently into a representation that preserves records’ similarity but obscures PII. To achieve accurate results, the parameters of a CLK encoding must be tuned to suit the data. To this end, we study a Bayesian optimization method for effectively tuning hyper-parameters for CLK-based PPRL. Moreover, ground-truth labels (match or non-match) would be useful for evaluating linkage quality in the optimization, but they are often difficult to access. We address this by proposing an unsupervised method that uses heuristics to estimate linkage quality. Finally, we investigate the information leakage risk with the iterative approach of optimization methods and discuss recommendations to mitigate the risk. Experimental results show that our method requires fewer iterations to achieve good linkage results compared to two baseline optimization methods. It not only improves linkage quality and computational efficiency of hyper-parameter optimization, but also reduces the privacy risk.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call