Abstract

The accuracy of a typical state-of-the-art optical character recognition (OCR) system benefits greatly from using a language model (LM). However, a conventional LM has a limited vocabulary, resulting in out-of-vocabulary (OOV) words that cannot be recognized by the OCR system. In this paper, we present an open vocabulary OCR system based on a hybrid LM. The vocabulary of the hybrid LM consists of both words and subwords. OOV words can be generated by combinations of subwords. A refined hybrid LM training scheme is applied by interpolating a standard hybrid LM, a word-based LM and a subword-based LM. An efficient word combination method is performed by modeling optional space symbols in a decoding network. The overall system deals with OOV words in a general, data-driven and language-independent way. We conduct experiments on an English handwriting OCR task. Evaluations on three testing sets demonstrate that the OCR system with the proposed method achieves a word error rate of 33.4% on an OOV-only testing set, yet without degrading the recognition accuracies on the other two testing sets mainly consisting of in-vocabulary words.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call