Abstract

Correcting historical corpora in digital version is a crucial task for the historical research, however, scan quality, book layout, visual character similarity can affect the quality of the recognizing. OCR is at the forefront of digitization projects for cultural heritage preservation. The main task is to identify characters from their visual form into their textual representation. In this paper, we propose a model combining recurrent neutral network(RNN) and deep convolutional network(DCNN) to correct OCR transcription errors. The experiment on a historical book corpus in German language shows that the model is very robust in capturing diverse OCR transcription errors greatly.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call