Abstract

Recently, an increasing number of studies have applied deep learning algorithms for extracting information from handwritten historical documents. In order to accomplish that, documents must be divided into smaller parts. Page and line segmentation are vital stages in the Handwritten Text Recognition systems; it directly affects the character segmentation stage, which in turn determines the recognition success. In this study, we first applied deep learning-based layout analysis techniques to detect individuals in the first Ottoman population register series collected between the 1840s and the 1860s. Then, we employed horizontal projection profile-based line segmentation to the demographic information of these detected individuals in these registers. We further trained a CNN model to recognize automatically detected ages of individuals and estimated age distributions of people from these historical documents. Extracting age information from these historical registers is significant because it has enormous potential to revolutionize historical demography of around 20 successor states of the Ottoman Empire or countries of today. We achieved approximately 60% digit accuracy for recognizing the numbers in these registers and estimated the age distribution with Root Mean Square Error 23.61.

Highlights

  • We have been living in written cultures for ages, and we produce vast amounts of documentation, but we are governed and ruled by them

  • We aimed to implement a method for recognizing text of similar registers from different regions of the Ottoman Empire conducted between the 1840s and the 1860s

  • We achieved approximately 60% digit detection accuracy

Read more

Summary

Introduction

We have been living in written cultures for ages, and we produce vast amounts of documentation, but we are governed and ruled by them. In the past, processing the information and correspondence kept in manuscripts was performed manually because of the lack of comprehensive and high-quality digitized datasets where an automatic method could be employed. Because of the rarity of high-quality digital scanning solutions and devices with high-storage capability, transforming and saving manuscript images from paper form to digital form was difficult. This job has become more evident due to dramatic progress in digital scanning and storage solutions. There are many digitized historical documents in Arabic script in the national libraries and archives around the world, thanks to the above-mentioned advances in technology. Historical Arabic document processing is a difficult research issue. The reasons could be listed as the complex nature of Arabic script compared to other scripts, and fragility of ancient documents, which are subject to degradation [4]

Objectives
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.