Text alignment in early printed books combining deep learning and dynamic programming

Zahra Ziran,Xavier Pic,Simone Undri Innocenti,Daniele Mugnai,Simone Marinai

doi:10.1016/j.patrec.2020.02.016

Abstract

We describe a technique for transcript alignment in early printed books by using deep models in combination with dynamic programming algorithms. Two object detection models, based on Faster R-CNN, are trained to locate words. We first train an initial model to recognize generic words and hyphens by using information about the number of words in text lines. Using the model prediction on pages with a line-by-line ground-truth annotation is available, we train a second model able to detect landmark words. The alignment is then based on the identification of landmark words in pages where we only know the text corresponding to zones in the page. The proposed technique is evaluated on a publicly available digitization of the Gutenberg Bible while the transcription is based on the Vulgata, a late 4th century Latin translation of the Bible.

Full Text