Abstract

Digital libraries offer access to a large number of handwritten historical documents. These documents are available as raw images and therefore their content is not searchable. A fully manual transcription is time-consuming and expensive while a fully automatic transcription is cheaper but not comparable in terms of accuracy. The performance of automatic transcription systems is strictly related to the composition of the training set. We propose a multi-step procedure that exploits a Keyword Spotting system and human validation for building up a training set in a time shorter than the one required by a fully manual procedure. The multi-step procedure was tested on a data set made up of 50 pages extracted from the Bentham collection. The palaeographer that transcribed the data set with the multi-step procedure instead of the fully manual procedure had a time gain of 52.54%. Moreover, a small size training set that allowed the keyword spotting system to show a precision value greater than the recall value was built with the multi-step procedure in a time equal to 35.25% of the time required for annotating the whole data set.

Highlights

  • In the last decade, significant investments were made for the digital transformation of cultural heritage material

  • The system has been designed for pursuing two goals: one is reducing the human time effort for building a TS to be used by any Handwritten text recognition (HTR) or Keyword spotting (KWS) system, the other is to build up a small size training set, from here on called reference set (RS), used by the KWS system we adopted for the assisted transcription of the DS

  • The experimentation has the aim of evaluating how good the multi-step procedure in building up a training set to be used in a KWS system is for document transcription

Read more

Summary

Introduction

Significant investments were made for the digital transformation of cultural heritage material. Online digital libraries store and share a huge number of historical books and manuscripts that were scanned for ensuring their preservation along the centuries. These digital collections are not searchable because their documents are digital images. These images need to be transcribed in order to allow the indexing and querying of the digital libraries. A fully manual transcription cannot be a solution because it is a time-consuming and expensive process. A large number of manuscripts need to be digitized and the trouble in reading documents written with a lexicon different respect to the one used nowadays impose the involvement of highly qualified experts in the transcription process

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call