Recognizing the orthography changes for identifying the temporal origin on the example of the Balkan historical documents

Darko Brodić,Alessia Amelio

doi:10.1007/s00521-017-3292-1

Abstract

This paper introduces a new approach for automatically identifying the temporal origin of the digitized historical documents stored as images on the example from the Balkan region. The approach is based on the concept that differentiation in the orthography style is determined by the evolution of scripts or languages over time. It is characterized by a phase of script coding, mapping the letters of the document into a sequence of numerical codes. Each code is associated with a gray level in the image space. Accordingly, the sequence of numerical codes can be transformed into an image. Then, texture analysis is used on the obtained image for the extraction of the document features. At the end, the feature vector of the document is classified for recognizing its orthography style. An experiment is performed on two databases and on a test collection of historical documents extracted from digitized books in Slavonic–Serbian and Serbian languages written in Cyrillic script and in Croatian recension of the Old Church Slavonic language written in angular Glagolitic script. Obtained results show the efficacy of the proposed approach, its robustness to ‘noisy' documents and its superiority when compared with other approaches using the language or script discrimination for orthography recognition in the literature.

Full Text