Abstract

Current OCR has limited capability for Arabic because of script models lacking scientific basis. We propose a new OCR strategy for Arabic, based on 1. Islamic script grammar including extended shaping and 2. treating Arabic script as a multi-layered writing system. We analyse Arabic script as an allographic rendering of graphemic abstractions. Grapheme is a term adapted from phonology; it is analogous to the term phoneme. In phonology, the smallest functional unit of sound is the phoneme. This is not heard, but perceived. What one hears are contextually conditioned allophones. In Arabic orthography, the smallest functional unit of spelling is the grapheme. This is not seen, but perceived. What one sees are contextually conditioned allographs. In our analysis, the letter block is the minimum unit of Arabic script formation and therefore of script grammar. A letter block is a single allograph or of a group of fused allographs surrounded by graphic space. The analogy with phonology can be pushed further: the archiphoneme is a bundle of shared features between two or more phonemes, minus their distinctive features. The archigrapheme is the bundle of shared features between two or more graphemes, minus their distinctive features. An archigraphemic letter block consists of one or more reduced allographs between spaces. The letter block follows the base line. There can be ligatures between letter blocks. In our strategy the archigraphemic letter block also forms the minimum unit of OCR. We have (1) implemented an algorithm that reduces any Unicode text in Arabic script to archigraphemes and we used it to create a list in Unicode format of all attested unique archigraphemic letter blocks on the internet. (2) With this list, and applying extended Islamic script grammar, we can synthesize realistic images of all possible archigraphemic fusions in a given style. These two developments make it possible to create an OCR system for recognizing synthetic Arabic under controlled conditions for both basic and extended shaping in a given style. These two steps result in competence, after which the OCR system should be trained to apply tolerance for the variation of performance in real documents. To interpret the identified letter blocks linguistically, a technique for the parsing of archigraphemes must be developed. For example, the single sequence of the three archigraphemic letter blocks EBD A LLH can be interpreted as several different surface texts such as abda-n li llaahi, abdu l-laahi and inda l-laahi. To facilitate the linguistic phase of the process, the same list of unique archigraphemic letter blocks is designed to identify the language of the text under scrutiny. In this phase we can present • Islamic script synthesis • Unicode conversion from plene orthography to archigraphemic transliteration • the archigraphemic search algorithm • the list of unique archigraphemic letter blocks • samples of authentic shape generation These are the first steps towards static OCR technology. The next step is to create or find matching AI software to teach OCR to recognize any unmapped letter blocks in order to make the OCR dynamic.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call