Transcription Alignment of Historical Vietnamese Manuscripts without Human-Annotated Learning Samples

Anna Scius-Bertrand,Andreas Fischer,Beat Wolf,Michael Jungo,Marc Bui

doi:10.3390/app11114894

Abstract

The current state of the art for automatic transcription of historical manuscripts is typically limited by the requirement of human-annotated learning samples, which are are necessary to train specific machine learning models for specific languages and scripts. Transcription alignment is a simpler task that aims to find a correspondence between text in the scanned image and its existing Unicode counterpart, a correspondence which can then be used as training data. The alignment task can be approached with heuristic methods dedicated to certain types of manuscripts, or with weakly trained systems reducing the required amount of annotations. In this article, we propose a novel learning-based alignment method based on fully convolutional object detection that does not require any human annotation at all. Instead, the object detection system is initially trained on synthetic printed pages using a font and then adapted to the real manuscripts by means of self-training. On a dataset of historical Vietnamese handwriting, we demonstrate the feasibility of annotation-free alignment as well as the positive impact of self-training on the character detection accuracy, reaching a detection accuracy of 96.4% with a YOLOv5m model without using any human annotation.

Highlights

The experimental results clearly demonstrate the high potential of object detection methods for the task of transcription alignment
For the historical Vietnamese dataset, such an alignment could be established without asking humans to manually annotate training images, which is a time-consuming and tedious work hampering the progress for digitizing large quantities of historical manuscripts
The proposed method is still learning-based, able to adapt to a variety of page layouts, page backgrounds, and handwriting styles by means of self-training

Summary

Introduction

To preserve and access our cultural heritage, the digitization of historical manuscripts is an important goal for libraries all around the world. Scanning the documents and indexing them with meta-information about author, date, place, etc. Afterwards, automatic document analysis and recognition is needed to extract texts, illustrations, signatures, stamps, etc. From the scanned page images and make them amenable to searching, browsing, indexing, and linking similar to websites on the Internet. The majority of current methods are based on machine learning and have a fundamental limitation —the need for human-annotated learning samples to train the document analysis systems. Considering the wide variety of historical documents, it is often necessary to manually annotate dozens or hundreds of pages, only to transcribe a few thousand similar pages afterwards with an automatic system

Objectives

Results

Conclusion