Abstract

This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’ performance. The dataset is called SHIBR (the Swedish Historical Birth Records). The contribution of this paper is twofold. First, we believe it is the first and the largest Swedish dataset of its kind provided as open access (15,000 high-resolution colour images of the era between 1800 and 1840). We also perform some data mining of the dataset to uncover some statistics and facts that might be of interest and use to genealogists. Second, we provide a comprehensive survey of contemporary datasets in the field that are open to the public along with a compact review of word spotting techniques. The word transcription file contains 17 columns of information pertaining to each image (e.g., child’s first name, birth date, date of baptism, father's first/last name, mother’s first/last name, death records, town, job title of the father/mother, etc.). Moreover, we evaluate some deep learning models, pre-trained on two other renowned datasets, for word spotting in SHIBR. However, our dataset proved challenging due to the unique handwriting style. Therefore, the dataset could also be used for competitions dedicated to a large set of document analysis problems, including word spotting.

Highlights

  • Digitising the past is a way to preserve history, restore deteriorating/uncompleted text, extract facts and information, and help in searching, document retrieval and data mining tasks

  • This paper presents a digital image dataset of historical handwritten birth records stored in the archives of several parishes across Sweden, together with the corresponding metadata that supports the evaluation of document analysis algorithms’ performance

  • As a baseline on the SHIBRp test set, we compute the mAP with respect to page retrieval (mAPpage) with respect to page retrieval (Sect. 5.1) using the Ctrl-F-Mini models trained on the George Washington Dataset [17] and on the IAM Offline Handwriting Dataset [23]

Read more

Summary

Introduction

Digitising the past is a way to preserve history, restore deteriorating/uncompleted text, extract facts and information, and help in searching, document retrieval and data mining tasks. Among the ‘‘Endangered Archives Programme’’ initiatives of the British Library is the digitisation of manuscripts of the Al-Aqsa Mosque Library, East Jerusalem [3] This historical collection contains more than a hundred Arabic language titles that span over several Islamic periods from the ninth century CE to the end of the Ottoman rule in Palestine at the beginning of the twentieth century. An old and still valid way to transcribe historical handwritten documents is to rely on crowdsourcing It is the practice of gathering information or input into a task by acquiring the services of a large number of people (a.k.a. crowd). We conclude this section by noting that this dataset would enrich the availability of historical handwritten document datasets and help develop more accurate algorithms for word spotting, optical character recognition (OCR), document layout analysis and image binarization It would serve the research community interested in history and heritage (i.e., genealogists), see Fig. 1. (ARDIS) [6], both of which are generously provided for free by Arkiv Digital AD AB, a Swedish company

Review of related public datasets
Limitations of existing document images databases
Challenges and opportunities in historical handwritten documents
Opportunities: window into the past
SHIBR dataset
Structure of SHIBR
Mining SHIBRm – Statistical insights
Experiments and results
Segmentation-free evaluation of word spotting
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call