Space Anomalies in OCRs for Arabic Like Scripts

Riaz Ahmad,M Zeshan Afzal,Andreas Dengel,Marcus Liwicki,S Faisal Rashid

doi:10.1109/asar.2018.8480229

Abstract

This paper investigates and analyses the nature of errors occurring in Optical Character Recognition (OCR) for Arabic-like scripts. Existing research on the area of OCR for Arabic-like scripts often focuses on achieving the best performance in terms of character error rates. Only little effort targets at the analysis of the nature of these errors (anomalies) that may occur. One such important anomaly is Space Anomaly. This anomaly is due to the presence of breaker characters that are an essential part of Arabic-like scripts. The spaces introduced by breaker characters are not depicted in the ground truth making it hard for OCR to generalize. The OCR model either learns to inhibit the original spaces or to generate extra spaces at places where they are not correct. Due to this confusion, the rendering looks sub-optimal. This analyses and removes space anomalies. We present a joint approach that does not only perform OCR but also handles the space anomalies in a robust manner, hence significantly outperforming the state-of-the-art. Although the implication of the work is shown by improved character recognition rate, the impact of this research is much higher in terms of the correctness of the OCR for useful purposes, especially for rendering. The claim is supported by empirical evaluation and it is shown that the proposed approach achieved the best results.

Full Text