Automatic Curation of Court Documents: Anonymizing Personal Data

Diego Garat,Dina Wonsever

doi:10.3390/info13010027

Abstract

In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the participants involved in a lawsuit without losing the meaning of the narrative of facts. In order to achieve this goal, we need, not only to recognize person names but also resolve co-references in order to assign the same label to all mentions of the same person. Our corpus has significant differences in the spelling of person names, so it was clear from the beginning that pre-existing tools would not be able to reach a good performance. The challenge was to find a good way of injecting specialized knowledge about person names syntax while taking profit of previous capabilities of pre-trained tools. We fine-tuned an NER analyzer and we built a clusterization algorithm to solve co-references between named entities. We present our first results, which, for both tasks, are promising: We obtained a 90.21% of F1-micro in the NER task—from a 39.99% score before retraining the same analyzer in our corpus—and a 95.95% ARI score in clustering for co-reference resolution.

Highlights

The National Jurisprudence Base’s (NJB) mission is to provide public access to the decisions of the different courts of the Judicial Branch, online and for free
Since its creation in 2008, rulings from different courts have been systematically added to the NJB, including those from the Supreme Court of Justice (SCJ), the Courts of Appeals and from the Courts of First Instance
The SCJ has detailed the data that must be deleted by current regulations—e.g., crimes related to modesty or decency, those involving minor offenders, etc.—but has advised against publishing data of another group of people that are not contemplated explicitly in the legislation, such as primary offenders, whistle-blowers, witnesses, etc

Summary

Introduction

The National Jurisprudence Base’s (NJB) mission is to provide public access to the decisions of the different courts of the Judicial Branch, online and for free. Judicial rulings contain a large number of citations to laws, decrees, bibliographies and previous rulings, in between others, which are not recollected or systematized in any way: The reader should look for them throughout the text They are, completely ignored during the incorporation of documents to the NJB, even though they could be useful for searching and data exploitation. Three lines of work were developed: (1) the anonymization of sensitive data, (2) the automatic classification within a legal taxonomy and (3) the detection of citations of various kinds within the texts of the judgments: to other judicial decisions, to laws and decrees or to previous arguments of referents in the field of Law. In this paper, we focus on the de-identification of proper names, and, for this problem, there are different ways to accomplish the anonymization process.

Related Work

De-Identification of Legal Texts

Corpus

NER Training

Co-Reference Resolution

Findings

Conclusions and Further Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Information	Publication Date: Jan 10, 2022
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Automatic Curation of Court Documents: Anonymizing Personal Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information

Lead the way for us

Similar Papers

(233–240) Proposals on orthography: Epithets honouring persons
Paul Van Rijckevorsel
TAXON | VOL. 72
Paul Van RijckevorselPaul Van Rijckevorsel
01 Apr 2023
(233–240) Proposals on orthography: Epithets honouring persons
Paul Van Rijckevorsel

Creating a Medication Therapy Observational Research Database from an Electronic Medical Record: Challenges and Data Curation.
Sonja Eberl ... Eva Neumann
Applied clinical informatics | VOL. 15
Sonja Eberl, et. al.Sonja Eberl ... Eva Neumann
01 Jan 2024
Applied clinical informatics | VOL. 15

Some notes on transferring proper names in “Ukrainian orthography” of 2019
Іryna Yefymenko
Ukrainska mova | VOL. -
Іryna YefymenkoІryna Yefymenko
01 Jan 2023
Some notes on transferring proper names in “Ukrainian orthography” of 2019
Іryna Yefymenko

Building Data Curation Processes with Crowd Intelligence
Tianwa Chen ... Marta Indulska
-
Tianwa Chen, et. al.Tianwa Chen ... Marta Indulska
01 Jan 2020
01 Jan 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automatic Curation of Court Documents: Anonymizing Personal Data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Information