Legal Entity Extraction using a Pointer Generator Network

Stavroula Skylaki,Ali Oskooei,Omar Bari,Nadja Herger,Zac Kriegman

doi:10.1109/icdmw53433.2021.00086

Stavroula Skylaki, Ali Oskooei + Show 3 more

https://doi.org/10.1109/icdmw53433.2021.00086

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Named Entity Recognition (NER) is the task of identifying and classifying named entities in unstructured text. In the legal domain, named entities of interest may include the case parties, judges, names of courts, case numbers, references to laws, etc. We study the problem of legal named entity extraction from noisy text extracted from PDF files of filed court cases from US courts. The "gold standard" training data for classical NER systems provide annotation for each token of the text with the corresponding entity or non-entity label. We work with only partially complete training data, which differ from the gold standard NER data in that the exact location of the entities in the text is unknown and the entities may contain typos and/or OCR mistakes. To overcome the challenges of our noisy training data, e.g., text extraction errors and/or typos and unknown label indices, we frame the NER task as a sequence generation task (seq2seq) and train a pointer generator network to generate the entities in the document rather than label them. We attempt to create a NER gold standard dataset via sequence matching and use this dataset to train classical NER baselines and compare them with our seq2seq approach for Named Entity (NE) extraction. We show that the seq2seq approach can effectively extract legal named entities, in the absence of gold standard data, and outperform the common neural network architectures for NER in long legal documents.

Full Text