Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Antonios Anastasopoulos ,Shruti Rijhwani ,Graham Neubig ,Daisy Rosenblum

doi:10.48448/fycy-h885

Abstract

Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents. Optical character recognition (OCR) can be used to produce digitized text, and previous work has demonstrated the utility of neural post-correction methods that improve the results of general-purpose OCR systems on recognition of less-well-resourced languages. However, these methods rely on manually curated post-correction data, which are relatively scarce compared to the non-annotated raw images that need to be digitized. In this paper, we present a semi-supervised learning method that makes it possible to utilize these raw images to improve performance, specifically through the use of self-training, a technique where a model is iteratively trained on its own outputs. In addition, to enforce consistency in the recognized vocabulary, we introduce a lexically aware decoding method that augments the neural post-correction model with a count-based language model constructed from the recognized texts, implemented using weighted finite-state automata (WFSA) for efficient and effective decoding. Results on four endangered languages demonstrate the utility of the proposed method, with relative error reductions of 15-29%, where we find the combination of self-training and lexically-aware decoding essential for achieving consistent improvements

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

Lexically Aware Semi-Supervised Learning for OCR Post-Correction
Shruti Rijhwani ... Graham Neubig
Transactions of the Association for Computational Linguistics | VOL. 9
Shruti Rijhwani, et. al.Shruti Rijhwani ... Graham Neubig
22 Nov 2021
Transactions of the Association for Computational Linguistics | VOL. 9

Rational Recurrences
Hao Peng ... Roy Schwartz
-
Hao Peng, et. al.Hao Peng ... Roy Schwartz
01 Jan 2018
01 Jan 2018

Optical Character Detection and Recognition for Image-Based in Natural Scene
Bochao Wang ... Chen Zhang
-
Bochao Wang, et. al.Bochao Wang ... Chen Zhang
01 Jan 2018
01 Jan 2018

Enhanced ResNet-151-based fused features for optimized Bi-LSTM-DNN-aided handwritten character and digits recognition
Srinivasa Rao N ... Nelson Kennedy Babu C
Expert Systems with Applications | VOL. 244
Srinivasa Rao N, et. al.Srinivasa Rao N ... Nelson Kennedy Babu C
08 Dec 2023
Expert Systems with Applications | VOL. 244

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Lexically-Aware Semi-Supervised Learning for OCR Post-Correction

Abstract

Talk to us

Similar Papers