Assessing the impact of OCR noise on multilingual event detection over digitised documents

Emanuela Boros,Gaël Lejeune,Antoine Doucet,Nhu Khoa Nguyen

doi:10.1007/s00799-022-00325-2

Abstract

Event detection is a crucial task for natural language processing and it involves the identification of instances of specified types of events in text and their classification into event types. The detection of events from digitised documents could enable historians to gather and combine a large amount of information into an integrated whole, a panoramic interpretation of the past. However, the level of degradation of digitised documents and the quality of the optical character recognition (OCR) tools might hinder the performance of an event detection system. While several studies have been performed in detecting events from historical documents, the transcribed documents needed to be hand-validated which implied a great effort of human expertise and manual labour-intensive work. Thus, in this study, we explore the robustness of two different event detection language-independent models to OCR noise, over two datasets that cover different event types and multiple languages. We aim at analysing their ability to mitigate problems caused by the low quality of the digitised documents and we simulate the existence of transcribed data, synthesised from clean annotated text, by injecting synthetic noise. For creating the noisy synthetic data, we chose to utilise four main types of noise that commonly occur after the digitisation process: Character Degradation, Bleed Through, Blur, and Phantom Character. Finally, we conclude that the imbalance of the datasets, the richness of the different annotation styles, and the language characteristics are the most important factors that can influence event detection in digitised documents.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal on Digital Libraries	Publication Date: Apr 4, 2022
Citations: 8	License type: cc-by

R Discovery Prime

R Discovery Prime

Assessing the impact of OCR noise on multilingual event detection over digitised documents

Abstract

Talk to us

Similar Papers

More From: International Journal on Digital Libraries

Lead the way for us

Similar Papers

Robustness Analysis on Graph Neural Networks Model for Event Detection
Hui Wei ... Hanqing Zhu
Applied Sciences | VOL. 12
Hui Wei, et. al.Hui Wei ... Hanqing Zhu
25 Oct 2022
Applied Sciences | VOL. 12

PNRank: Unsupervised ranking of person name entities from noisy OCR text
Haimonti Dutta ... Aayushee Gupta
Decision Support Systems | VOL. 152
Haimonti Dutta, et. al.Haimonti Dutta ... Aayushee Gupta
21 Aug 2021
Decision Support Systems | VOL. 152

Towards a Novel Weakly Supervised Joint Approach of Named Entity Recognition and Normalization for Noisy Text
Assia Mezhar ... Mohammed Ramdani
SSRN Electronic Journal | VOL. -
Assia Mezhar, et. al.Assia Mezhar ... Mohammed Ramdani
01 Jan 2018
SSRN Electronic Journal | VOL. -

Towards a Novel Weakly Supervised Joint Approach of Named Entity Recognition and Normalization for Noisy Text
Assia Mezhar ... Amal El Mzabi
SSRN Electronic Journal | VOL. -
Assia Mezhar, et. al.Assia Mezhar ... Amal El Mzabi
09 May 2018
SSRN Electronic Journal | VOL. -

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Assessing the impact of OCR noise on multilingual event detection over digitised documents

Abstract

Talk to us

Similar Papers

More From: International Journal on Digital Libraries