Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Benet Manzanares-Salor,David Sánchez,Pierre Lison

doi:10.1007/s10618-024-01066-3

Abstract

The availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Abstract

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery

Lead the way for us

Journal: Data Mining and Knowledge Discovery	Publication Date: Sep 3, 2024
License type: CC BY 4.0

Similar Papers

Semi-Supervised Network Embedding
Chaozhuo Li ... Senzhang Wang
-
Chaozhuo Li, et. al.Chaozhuo Li ... Senzhang Wang
01 Jan 2017
01 Jan 2017

Exploring new privacy approaches in a scalable classification framework
M Saravanan ... A M Thoufeeq
-
M Saravanan, et. al.M Saravanan ... A M Thoufeeq
01 Oct 2014
01 Oct 2014

Efficient and scalable multi-class classification using naïve Bayes tree
Dewan Md Farid ... Mohammad Masudur Rahman
-
Dewan Md Farid, et. al.Dewan Md Farid ... Mohammad Masudur Rahman
01 May 2014
01 May 2014

Privacy-Preserving Data Mining
Charu C Aggarwal ... Philip S Yu
-
Charu C Aggarwal, et. al.Charu C Aggarwal ... Philip S Yu
01 Jan 2008
01 Jan 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Abstract

Talk to us

Similar Papers

More From: Data Mining and Knowledge Discovery