Abstract

Data sharing is a central aspect of judicial systems. The openly accessible documents can make the judiciary system more transparent. On the other hand, the published legal documents can contain much sensitive information about the involved persons or companies. For this reason, the anonymization of these documents is obligatory to prevent privacy breaches. General Data Protection Regulation (GDPR) and other modern privacy-protecting regulations have strict definitions of private data containing direct and indirect identifiers. In legal documents, there is a wide range of attributes regarding the involved parties. Moreover, legal documents can contain additional information about the relations between the involved parties and rare events. Hence, the personal data can be represented by a sparse matrix of these attributes. The application of Named Entity Recognition methods is essential for a fair anonymization process but is not enough. Machine learning-based methods should be used together with anonymization models, such as differential privacy, to reduce re-identification risk. On the other hand, the information content (utility) of the text should be preserved. This paper aims to summarize and highlight the open and symmetrical problems from the fields of structured and unstructured text anonymization. The possible methods for anonymizing legal documents discussed and illustrated by case studies from the Hungarian legal practice.

Highlights

  • Published: 13 August 2021Digitalization of judicial systems is an important goal of the European Union [1].Sharing and making court decisions and different legal documents accessible online is a crucial part of this intention

  • Sweeney made a famous linking attack on the set of public health records collected by The National Association of Health Data Organizations (NAHDO) in many states where they had legislative mandates to collect hospital-level data (Figure 3)

  • In Hungary, the Act CXII of 2011 (InfoAct) states that the data subject’s rights after their death could be exercised either by a person appointed by the data subject during their life or a close relative

Read more

Summary

Introduction

Digitalization of judicial systems is an important goal of the European Union [1]. Sharing and making court decisions and different legal documents accessible online is a crucial part of this intention. In 2019, a group of researchers carried out a linking attack against anonymized legal cases in Switzerland They published a study where they presented that using artificial intelligence methods with big data collected from other publicly available databases, they could re-identify 84% of the people, being anonymized in this database, in less than an hour [19]. The current anonymization practice in many European Union countries means the masking of the names and other direct identifiers of the involved persons This process does not fulfill the requirements of the General Data Protection Regulation. These examples show that mathematical statistical analysis is important in filtering those unique events, that may serve as a primary identifier (e.g., the surgeon amputates the wrong leg) Those applications and services, which link the legal documents together with other databases, need a special care to consider the GDPR recommendations

Privacy and Anonymization
Privacy Models
Types of Privacy Attacks
Structure and Privacy Risks in Hungarian Legal Documents
Criticism of Current Regulation
Current Practice and Potential Risks
Datasets and Search Framework
Illustrative Examples
Quantifying Risk
The Threshold
Automatized Workflows for Pseudonymization
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call