Building an efficient OCR system for historical documents with little training data

Jiří Martínek,Pavel Král,Ladislav Lenc

doi:10.1007/s00521-020-04910-x

Jiří Martínek, Pavel Král + Show 1 more

Open Access

https://doi.org/10.1007/s00521-020-04910-x

Copy DOI

Journal: Neural Computing and Applications	Publication Date: May 9, 2020
Citations: 33	License type: open-access

Affiliation: University of West Bohemia

Abstract

As the number of digitized historical documents has increased rapidly during the last a few decades, it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. Nowadays, OCR methods are often not adapted to the historical domain; moreover, they usually need a significant amount of annotated documents. Therefore, this paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented complete OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our segmentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. Both approaches are state of the art in the relevant fields. We have created a novel real dataset for OCR from Porta fontium portal. This corpus is freely available for research, and all proposed methods are evaluated on these data. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems. To sum up, this paper shows a way how to create an efficient OCR system for historical documents with a need for only a little annotated training data.

Highlights

Digitization of historical documents is an important task for preserving our cultural heritage
Our segmentation methods are based on fully convolutional networks, and the optical character recognition (OCR) approach utilizes recurrent neural networks
We train the models first on a subset of the Europeana newspaper dataset, and the models are fine-tuned on the training set of the Porta fontium dataset

Summary

Introduction

Digitization of historical documents is an important task for preserving our cultural heritage. During the last a few decades, the amount of digitized archival material has this paper introduces a set of methods to convert historical scans into their textual representation for efficient information retrieval based on a minimal number of manually annotated documents. This problem includes two main tasks: page layout analysis (including text block and line segmentation) and optical character recognition (OCR). One goal of this project is to enable an intelligent full-text access to the printed historical documents from the Czech–Bavarian border region. Our original data sources are scanned texts from German historical newspapers printed with Fraktur from the second half of the nineteenth century

Objectives

Methods

Findings

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Building an efficient OCR system for historical documents with little training data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Neural Computing and Applications

Lead the way for us

Similar Papers

Improved parcel sorting by combining automatic speech and character recognition
Amriteshwar Singh ... John H L Hansen
-
Amriteshwar Singh, et. al.Amriteshwar Singh ... John H L Hansen
01 Jan 2012
01 Jan 2012

Musical chords transposer for captured image based on Optical Character Recognition
Nitisart Tunyasanon ... Hirankrit Loturat
-
Nitisart Tunyasanon, et. al.Nitisart Tunyasanon ... Hirankrit Loturat
01 Jan 2017
01 Jan 2017

FAWA: Fast Adversarial Watermark Attack on Optical Character Recognition (OCR) Systems
Lu Chen ... Jiao Sun
-
Lu Chen, et. al.Lu Chen ... Jiao Sun
01 Jan 2020
01 Jan 2020

Designing a Real-Time-Based Optical Character Recognition to Detect ID Cards
Rodzan Iskandar ... Mezan El Khaeri Kesuma
International Journal of Electronics and Communications Systems | VOL. 2
Rodzan Iskandar, et. al.Rodzan Iskandar ... Mezan El Khaeri Kesuma
30 Jun 2022
International Journal of Electronics and Communications Systems | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Building an efficient OCR system for historical documents with little training data

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Neural Computing and Applications