Abstract

This paper investigates and analyses the nature of errors occurring in Optical Character Recognition (OCR) for Arabic-like scripts. Existing research on the area of OCR for Arabic-like scripts often focuses on achieving the best performance in terms of character error rates. Only little effort targets at the analysis of the nature of these errors (anomalies) that may occur. One such important anomaly is Space Anomaly. This anomaly is due to the presence of breaker characters that are an essential part of Arabic-like scripts. The spaces introduced by breaker characters are not depicted in the ground truth making it hard for OCR to generalize. The OCR model either learns to inhibit the original spaces or to generate extra spaces at places where they are not correct. Due to this confusion, the rendering looks sub-optimal. This analyses and removes space anomalies. We present a joint approach that does not only perform OCR but also handles the space anomalies in a robust manner, hence significantly outperforming the state-of-the-art. Although the implication of the work is shown by improved character recognition rate, the impact of this research is much higher in terms of the correctness of the OCR for useful purposes, especially for rendering. The claim is supported by empirical evaluation and it is shown that the proposed approach achieved the best results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.