Abstract

With the development of information technology, it has become a popular topic to share data from multiple sources without privacy disclosure problems. Privacy-preserving record linkage (PPRL) can link the data that truly matches and does not disclose personal information. In the existing studies, the techniques of PPRL have mostly been studied based on the alphabetic language, which is much different from the Chinese language environment. In this paper, Chinese characters (identification fields in record pairs) are encoded into strings composed of letters and numbers by using the SoundShape code according to their shapes and pronunciations. Then, the SoundShape codes are encrypted by Bloom filter, and the similarity of encrypted fields is calculated by Dice similarity. In this method, the false positive rate of Bloom filter and different proportions of sound code and shape code are considered. Finally, we performed the above methods on the synthetic datasets, and compared the precision, recall, F1-score and computational time with different values of false positive rate and proportion. The results showed that our method for PPRL in Chinese language environment improved the quality of the classification results and outperformed others with a relatively low additional cost of computation.

Highlights

  • In the era of Big Data, it has become increasingly important to obtain more information through multisource data fusion for data analysis, and many organizations have begun to collect and process data from multiple sources to capture valuable information

  • Probabilistic record linkage proposed by Winkler [26] (PRL-W) is an extension of Fellegi and Sunter [27] approach (PRL-FS)

  • In this paper, an improved similarity calculation method based on SoundShape code is proposed to adapt the task of Privacy-preserving record linkage (PPRL) in the Chinese environment

Read more

Summary

Introduction

In the era of Big Data, it has become increasingly important to obtain more information through multisource data fusion for data analysis, and many organizations have begun to collect and process data from multiple sources to capture valuable information. Records can be linked if the unique identifiers (UIDs) of individuals are available. When UIDs between different databases are missing, records in these databases can be integrated and linked through probabilistic record linkage using personal identification fields (e.g., name and address) [1]. Privacy-preserving record linkage [2] (referred to PPRL) can solve the above problems well, which ensures that only the final matched record information is shared between data sources, and does not reveal the information of other unmatched records

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call