Abstract

We present an efficient and effective approach to train OCR engines using the Aletheia document analysis system. All components required for training are seamlessly integrated into Aletheia: training data preparation, the OCR engine’s training processes themselves, text recognition, and quantitative evaluation of the trained engine. Such a comprehensive training and evaluation system, guided through a GUI, allows for iterative incremental training to achieve best results. The widely used Tesseract OCR engine is used as a case study to demonstrate the efficiency and effectiveness of the proposed approach. Experimental results are presented validating the training approach with two different historical datasets, representative of recent significant digitisation projects. The impact of different training strategies and training data requirements is presented in detail.

Highlights

  • Document digitisation is an everyday continuing activity at all scales, ranging from the very large content holding institutions to medium-sized operations to individuals undertaking small projects

  • In addition to the detailed description of the proposed OCR engine training system, this paper reports on a number of experiments carried out on different datasets to investigate the ideal training conditions in terms of size and quality of a training set

  • The effectiveness of the training process was investigated and validated using two very different datasets, each representing a realistic use scenario where OCR engine training can make a difference: a sample of the 1961 Census for England and Wales, and a historical book from the Bibliothèque National de France (French National Library) dated 1603; collected and ground truthed for the IMPACT project [28], representing an example of typical historical fonts

Read more

Summary

Introduction

Document digitisation is an everyday continuing activity at all scales, ranging from the very large content holding institutions (e.g. libraries, archives) to medium-sized operations (e.g. charities, community enterprises) to individuals undertaking small projects. Some systems allow adjustments via recognition parameters, but this has typically no major impact on results In such cases, training OCR engines become important in order to recognise those rarer/historic fonts and languages. Previous large-scale research projects related to mass digitisation [5] have demonstrated the potential gain in accuracy when training OCR engines to the material which is to be processed. Such gains can be achieved even for systems which were designed to follow an Omni-font approach, i.e. not solely relying on comparison of fixed shapes and patterns but employing more flexible features. The report states improvements from 45 to 80% character accuracy rate and 15–55% word accuracy rate for typically very challenging to recognise Gothic documents after training ABBYY FineReader on only very few pages

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call