Abstract

In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the participants involved in a lawsuit without losing the meaning of the narrative of facts. In order to achieve this goal, we need, not only to recognize person names but also resolve co-references in order to assign the same label to all mentions of the same person. Our corpus has significant differences in the spelling of person names, so it was clear from the beginning that pre-existing tools would not be able to reach a good performance. The challenge was to find a good way of injecting specialized knowledge about person names syntax while taking profit of previous capabilities of pre-trained tools. We fine-tuned an NER analyzer and we built a clusterization algorithm to solve co-references between named entities. We present our first results, which, for both tasks, are promising: We obtained a 90.21% of F1-micro in the NER task—from a 39.99% score before retraining the same analyzer in our corpus—and a 95.95% ARI score in clustering for co-reference resolution.

Highlights

  • The National Jurisprudence Base’s (NJB) mission is to provide public access to the decisions of the different courts of the Judicial Branch, online and for free

  • Since its creation in 2008, rulings from different courts have been systematically added to the NJB, including those from the Supreme Court of Justice (SCJ), the Courts of Appeals and from the Courts of First Instance

  • The SCJ has detailed the data that must be deleted by current regulations—e.g., crimes related to modesty or decency, those involving minor offenders, etc.—but has advised against publishing data of another group of people that are not contemplated explicitly in the legislation, such as primary offenders, whistle-blowers, witnesses, etc

Read more

Summary

Introduction

The National Jurisprudence Base’s (NJB) mission is to provide public access to the decisions of the different courts of the Judicial Branch, online and for free. Judicial rulings contain a large number of citations to laws, decrees, bibliographies and previous rulings, in between others, which are not recollected or systematized in any way: The reader should look for them throughout the text They are, completely ignored during the incorporation of documents to the NJB, even though they could be useful for searching and data exploitation. Three lines of work were developed: (1) the anonymization of sensitive data, (2) the automatic classification within a legal taxonomy and (3) the detection of citations of various kinds within the texts of the judgments: to other judicial decisions, to laws and decrees or to previous arguments of referents in the field of Law. In this paper, we focus on the de-identification of proper names, and, for this problem, there are different ways to accomplish the anonymization process.

Related Work
De-Identification of Legal Texts
Corpus
NER Training
Co-Reference Resolution
Findings
Conclusions and Further Work
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call