Abstract
Data sharing is a central aspect of judicial systems. The openly accessible documents can make the judiciary system more transparent. On the other hand, the published legal documents can contain much sensitive information about the involved persons or companies. For this reason, the anonymization of these documents is obligatory to prevent privacy breaches. General Data Protection Regulation (GDPR) and other modern privacy-protecting regulations have strict definitions of private data containing direct and indirect identifiers. In legal documents, there is a wide range of attributes regarding the involved parties. Moreover, legal documents can contain additional information about the relations between the involved parties and rare events. Hence, the personal data can be represented by a sparse matrix of these attributes. The application of Named Entity Recognition methods is essential for a fair anonymization process but is not enough. Machine learning-based methods should be used together with anonymization models, such as differential privacy, to reduce re-identification risk. On the other hand, the information content (utility) of the text should be preserved. This paper aims to summarize and highlight the open and symmetrical problems from the fields of structured and unstructured text anonymization. The possible methods for anonymizing legal documents discussed and illustrated by case studies from the Hungarian legal practice.
Highlights
Published: 13 August 2021Digitalization of judicial systems is an important goal of the European Union [1].Sharing and making court decisions and different legal documents accessible online is a crucial part of this intention
Sweeney made a famous linking attack on the set of public health records collected by The National Association of Health Data Organizations (NAHDO) in many states where they had legislative mandates to collect hospital-level data (Figure 3)
In Hungary, the Act CXII of 2011 (InfoAct) states that the data subject’s rights after their death could be exercised either by a person appointed by the data subject during their life or a close relative
Summary
Digitalization of judicial systems is an important goal of the European Union [1]. Sharing and making court decisions and different legal documents accessible online is a crucial part of this intention. In 2019, a group of researchers carried out a linking attack against anonymized legal cases in Switzerland They published a study where they presented that using artificial intelligence methods with big data collected from other publicly available databases, they could re-identify 84% of the people, being anonymized in this database, in less than an hour [19]. The current anonymization practice in many European Union countries means the masking of the names and other direct identifiers of the involved persons This process does not fulfill the requirements of the General Data Protection Regulation. These examples show that mathematical statistical analysis is important in filtering those unique events, that may serve as a primary identifier (e.g., the surgeon amputates the wrong leg) Those applications and services, which link the legal documents together with other databases, need a special care to consider the GDPR recommendations
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have