Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Tomasz Stanisławek,Przemysław Biecek,Anna Wróblewska,Dawid Lipiński,Agnieszka Kaliska,Paulina Rosalska,Filip Graliński,Bartosz Topolski

doi:10.1007/978-3-030-86549-8_36

Abstract

The relevance of the Key Information Extraction (KIE) task is increasingly important in natural language processing problems. But there are still only a few well-defined problems that serve as benchmarks for solutions in this area. To bridge this gap, we introduce two new datasets (Kleister NDA and Kleister Charity). They involve a mix of scanned and born-digital long formal English-language documents. In these datasets, an NLP system is expected to find or infer various types of entities by employing both textual and structural layout features. The Kleister Charity dataset consists of 2,788 annual financial reports of charity organizations, with 61,643 unique pages and 21,612 entities to extract. The Kleister NDA dataset has 540 Non-disclosure Agreements, with 3,229 unique pages and 2,160 entities to extract. We provide several state-of-the-art baseline systems from the KIE domain (Flair, BERT, RoBERTa, LayoutLM, LAMBERT), which show that our datasets pose a strong challenge to existing models. The best model achieved an 81.77% and an 83.57% F1-score on respectively the Kleister NDA and the Kleister Charity datasets. We share the datasets to encourage progress on more in-depth and complex information extraction tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Improving information extraction from visually rich documents using visual span representations
Ritesh Sarkhel ... Arnab Nandi
Proceedings of the VLDB Endowment | VOL. 14
Ritesh Sarkhel, et. al.Ritesh Sarkhel ... Arnab Nandi
01 Jan 2020
Proceedings of the VLDB Endowment | VOL. 14

A Review of Open Information Extraction Techniques
Sally Ali ... M Hussien
IJCI. International Journal of Computers and Information | VOL. 6
Sally Ali, et. al.Sally Ali ... M Hussien
01 Jan 2019
IJCI. International Journal of Computers and Information | VOL. 6

When Harry met Harri
Fei Huang ... Ahmad Emami
-
Fei Huang, et. al.Fei Huang ... Ahmad Emami
01 Jan 2008
01 Jan 2008

Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks.
Mohammed Alawad ... J Blair Christian
Journal of the American Medical Informatics Association | VOL. 27
Mohammed Alawad, et. al.Mohammed Alawad ... J Blair Christian
09 Nov 2019
Journal of the American Medical Informatics Association | VOL. 27

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts

Abstract

Talk to us

Similar Papers