Preparing Legal Documents for NLP Analysis: Improving the Classification of Text Elements by Using Page Features

Frieda Josi,Christian Wartena,Ulrich Heid

doi:10.5121/csit.2022.120102

Abstract

Legal documents often have a complex layout with many different headings, headers and footers, side notes, etc. For the further processing, it is important to extract these individual components correctly from a legally binding document, for example a signed PDF. A common approach to do so is to classify each (text) region of a page using its geometric and textual features. This approach works well, when the training and test data have a similar structure and when the documents of a collection to be analyzed have a rather uniform layout. We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types. After that, we can train a classifier for each of the three page types and thereby improve the accuracy on a manually annotated collection of 70 legal documents consisting of 20,938 text elements. When we split by page type, we achieve an improvement from 0.95 to 0.98 for single-column pages with left marginalia and from 0.95 to 0.96 for double-column pages. We developed our own feature-based method for page layout detection, which we benchmark against a standard implementation of a CNN image classifier.

Highlights

Many documents are only available as PDF
We show that the use of global page properties can improve the accuracy of text element classification: we first classify each page into one of three layout types
We can train a classifier for each of the three page types and thereby improve the accuracy on a manually annotated collection of 70 legal documents consisting of 20,938 text elements

Summary

Introduction

Many documents are only available as PDF. This is especially the case for legal documents where one exact copy including layout and signatures is distributed and archived. Extracting the text from a legal document is often challenging since e.g. contracts often have a complex structure with lists, footnotes, side notes, multiple columns, headers and footers and so on. Contracts often consist of several parts, like address page, signature page, project description, terms of service etc. Which each may have a completely different layout. In order to extract texts from a PDF we first identify characters, regions of closely neighbouring characters (words) and regions with dense text.

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Preparing Legal Documents for NLP Analysis: Improving the Classification of Text Elements by Using Page Features

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 22, 2022
Citations: 1	License type: cc-by

Similar Papers

Sodium dodecyl sulfate-polyacrylamide gel typing system for characterization of Neisseria meningitidis isolates.
L F Mocca ... C E Frasch
Journal of clinical microbiology | VOL. 16
L F Mocca, et. al.L F Mocca ... C E Frasch
01 Aug 1982
Journal of clinical microbiology | VOL. 16

Cooperation of a museum institution and students in creating virtual exhibitions using the MOVIO tool
Goran Zlodi ... Josip Mihaljević
-
Goran Zlodi, et. al.Goran Zlodi ... Josip Mihaljević
01 Jan 2015
01 Jan 2015

PACA: A Page Type Aware Read Cache Scheme in QLC Flash-based SSDs
Qihui Chen ... Shu Li
-
Qihui Chen, et. al.Qihui Chen ... Shu Li
01 Oct 2022
01 Oct 2022

PA-SSD
Wenhui Zhang ... Jie Yao
-
Wenhui Zhang, et. al.Wenhui Zhang ... Jie Yao
12 Jun 2018
12 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Preparing Legal Documents for NLP Analysis: Improving the Classification of Text Elements by Using Page Features

Abstract

Highlights

Summary

Talk to us

Similar Papers