Abstract

OCR has seen major improvements in recent years, even though conventional OCR strategies don’t yet exploit linguistic concepts on Arabic script analysis. We present a new, additional strategy that aims to enhance Arabic OCR. In this approach A. disambiguating dots are temporarily eliminated, which reduces classes of graphemes sharing the same base element to single archigraphemes and B. contextual behaviour of Arabic archigraphemes is redefined as fusing: archigraphemes merge unrecognizably into letter blocks according to a rule-based system called script grammar. The letter block is defined as the minimum unit of Arabic script formation. E.g., the word بحوث consists of two letter blocks, groups of fused allographs surrounded by graphic space, ٮحو and ٮ (BGW B). From an Arabic corpus of circa 85 million words we extracted a list of circa 47,000 unique archigraphemic letter blocks, which implies that we reduced the generative, dynamic Arabic writing system to proportions of a static script like Chinese. We then show how to synthesise all theoretical shapes for each letter block from computer models of specific Islamic script styles (ruqʿä, naskh, nastaʿlīq). Only in the final stage, we would need to disambiguate the archigraphemes into actual graphemes using linguistic information, part of which we already gathered from the 85 million words corpus. This approach also makes initial OCR training possible on texts rendered with the very same Islamic script models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call