Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Shruti Rijhwani,Antonios Anastasopoulos,Graham Neubig,Daisy Rosenblum

doi:10.1162/tacl_a_00427

Abstract

Abstract Much of the existing linguistic data in many languages of the world is locked away in non- digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general- purpose OCR systems on recognition of less- well-resourced languages. However, these methods rely on manually curated post- correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15%–29%, where we find the combination of self-training and lexically aware decoding essential for achieving consistent improvements.1

Highlights

There is a vast amount of textual data available in printed form (Dong and Smith, 2018)
Metrics We evaluate our systems in terms of character error rate (CER) and word error rate (WER), both standard metrics for measuring Optical character recognition (OCR) and OCR post-correction performance (BergKirkpatrick et al, 2013; Schulz and Kuhn, 2017)
For all languages, using semi-supervised learning leads to substantial reductions in both CER and WER

Summary

Introduction

There is a vast amount of textual data available in printed form (Dong and Smith, 2018). We address the task of digitizing printed materials that contain text in endangered languages, i.e., languages with small populations of first-language [Image] [First pass OCR] [Post-corrected] ⏐ ⏐ ↓. Automatic digitization can aid language documentation, preservation, and accessibility efforts by archiving the texts and making them searchable for language learners, teachers, and speakers, contributing to essential resources for community-based language revitalization. Most endangered languages are under-represented in natural language processing technologies, primarily because there is little to no data available for training and evaluation (Joshi et al, 2020). This challenge can be mitigated by converting printed materials in these languages to a machine-readable format

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Transactions of the Association for Computational Linguistics	Publication Date: Nov 22, 2021
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics

Lead the way for us

Similar Papers

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction
...
-
, et. al. ...
21 Oct 2021
21 Oct 2021

Rational Recurrences
Hao Peng ... Noah A. Smith
-
Hao Peng, et. al.Hao Peng ... Noah A. Smith
01 Jan 2018
01 Jan 2018

Optical Character Detection and Recognition for Image-Based in Natural Scene
Bochao Wang ... Yiheng Cai
-
Bochao Wang, et. al.Bochao Wang ... Yiheng Cai
01 Jan 2018
01 Jan 2018

Enhanced ResNet-151-based fused features for optimized Bi-LSTM-DNN-aided handwritten character and digits recognition
Srinivasa Rao N ... Nelson Kennedy Babu C
Expert Systems With Applications | VOL. 244
Srinivasa Rao N, et. al.Srinivasa Rao N ... Nelson Kennedy Babu C
08 Dec 2023
Expert Systems With Applications | VOL. 244

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lexically Aware Semi-Supervised Learning for OCR Post-Correction

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Transactions of the Association for Computational Linguistics