Abstract

Existing text recognition engines enables to train general models to recognize not only one specific hand but a multitude of historical hands within a particular script, and from a rather large time period (more than 100 years). This paper compares different text recognition engines and their performance on a test set independent of the training and validation sets. We argue that both, test set and ground truth, should be made available by researchers as part of a shared task to allow for the comparison of engines. This will give insight into the range of possible options for institutions in need of recognition models. As a test set, we provide a data set consisting of 2,426 lines which have been randomly selected from meeting minutes of the Swiss Federal Council from 1848 to 1903. To our knowledge, neither the aforementioned text lines, which we take as ground truth, nor the multitude of different hands within this corpus have ever been used to train handwritten text recognition models. In addition, the data set used is perfect for making comparisons involving recognition engines and large training sets due to its variability and the time frame it spans. Consequently, this paper argues that both the tested engines, HTR+ and PyLaia, can handle large training sets. The resulting models have yielded very good results on a test set consisting of unknown but stylistically similar hands.

Highlights

  • Since the early 1990s, recognition of printed text has been based on engines for optical character recognition (OCR) (Rice et al, 1993)

  • The output of handwritten text recognition models usually leads to further processable results if the Character Error Rate (CER) is below 10%

  • It is necessary to assess the quality of the models not solely on validation sets containing the hands of the training set which have already been seen, and on a test set consisting of similar hands of the same era and written in the same script type

Read more

Summary

THE QUEST FOR TEXT RECOGNITION

Since the early 1990s, recognition of printed text has been based on engines for optical character recognition (OCR) (Rice et al, 1993). Its primary focus is on general models which train recognition models that are capable of recognizing not just one specific hand but similar scripts from different hands that the model has not previously seen which is one of the remaining problems in handwritten text recognition To build such models, it is necessary to bring together large masses of ground truthed data (transcribed text aligned with images). It is necessary to assess the quality of the models not solely on validation sets containing the hands of the training set which have already been seen, and on a test set consisting of similar hands of the same era and written in the same script type For this purpose, we propose a test set for German Kurrent scripts of the 19th century, because enough training material is already available to test the capabilities of text recognition engines for this script. Through cooperation of large communities of stakeholders, including scholars, scientists, and the interested public, we will be able to make the handwritten material of the world better and easier accessible

RECOGNIZING HANDWRITING: A RESOLVED TASK
REAL-WORLD SITUATION
TOWARDS GENERAL MODELS OF RECOGNITION
SPECIFIC MODELS
ENGINES AND TESTING THE ASSUMPTION OF LARGE GROUND TRUTH COLLECTIONS
GENERALIZING MODELS
CREATION OF SPECIFIC TEST SETS
ASSESSING MODELS BASED ON THE TRAINING SET
PUBLICATION OF GROUND TRUTH
Findings
GROUND TRUTH AND TEST SET CREATION AS A SHARED TASK
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.