Abstract

Historical document analysis systems gain importance with the increasing efforts in the digitalization of archives. Page segmentation and layout analysis are crucial steps for such systems. Errors in these steps will affect the outcome of handwritten text recognition and Optical Character Recognition (OCR) methods, which increase the importance of the page segmentation and layout analysis. Degradation of documents, digitization errors, and varying layout styles are the issues that complicate the segmentation of historical documents. The properties of Arabic scripts such as connected letters, ligatures, diacritics, and different writing styles make it even more challenging to process Arabic script historical documents. In this study, we developed an automatic system for counting registered individuals and assigning them to populated places by using a CNN-based architecture. To evaluate the performance of our system, we created a labeled dataset of registers obtained from the first wave of population registers of the Ottoman Empire held between the 1840s and 1860s. We achieved promising results for classifying different types of objects and counting the individuals and assigning them to populated places.

Highlights

  • Historical documents are valuable cultural resources that provide the examination of the historical, social, and economic aspects of the past

  • We developed an automatic individual counting system for the registers recorded in the first censuses of the Ottoman Empire, which were held between 1840 and 1860

  • The registers were written in Arabic script, and their layouts highly depended on the district and the officer in charge

Read more

Summary

Introduction

Historical documents are valuable cultural resources that provide the examination of the historical, social, and economic aspects of the past. Their digitization provides immediate access for researchers and the public to these archives. For digitalizing the historical documents, page segmentation of different areas is a critical process for further document analysis [1]. Example applications of historical document processing could be historical weather analysis [2], personnel record analysis [3], and digitization of music score images (OMR) [4]. Page segmentation techniques analyze the document by dividing the image into different regions such as backgrounds, texts, graphics, and decorations [5]. It is difficult to segment them by applying projection-based or rule-based methods [5]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call