Assessing the Impact of OCR Quality on Downstream NLP Tasks

Daniel Van Strien,Kaspar Beelen,Mariona Ardanuy,Kasra Hosseini,Barbara Mcgillivray,Giovanni Colavizza

doi:10.5220/0009169004840496

Daniel Van Strien, Kaspar Beelen + Show 4 more

Open Access

https://doi.org/10.5220/0009169004840496

Copy DOI

Abstract

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2020
Citations: 61	License type: cc-by-nc-nd

Similar Papers

BERTimbau: Pretrained BERT Models for Brazilian Portuguese
Fábio Souza ... Rodrigo Nogueira
-
Fábio Souza, et. al.Fábio Souza ... Rodrigo Nogueira
01 Jan 2020
01 Jan 2020

BERT models for Brazilian Portuguese: Pretraining, evaluation and tokenization analysis
F.C Souza ... R.A Lotufo
Applied Soft Computing | VOL. 149
F.C Souza, et. al.F.C Souza ... R.A Lotufo
07 Oct 2023
Applied Soft Computing | VOL. 149

BERTimbau: pretrained BERT models for Brazilian Portuguese
...
-
, et. al. ...
15 Oct 2020
15 Oct 2020

GREEK-BERT: The Greeks visiting Sesame Street
John Koutsikakis ... Ilias Chalkidis
-
John Koutsikakis, et. al.John Koutsikakis ... Ilias Chalkidis
02 Sep 2020
02 Sep 2020

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Abstract

Talk to us

Similar Papers