A Holistic Technique for an Arabic OCR System

Farhan Nashwan,Hassanin Al-Barhamtoshy,Abdullah Moussa,Mohsen Rashwan,Sherif Abdou

doi:10.3390/jimaging4010006

Farhan Nashwan, Hassanin Al-Barhamtoshy + Show 3 more

Open Access

https://doi.org/10.3390/jimaging4010006

Copy DOI

Journal: Journal of Imaging	Publication Date: Dec 27, 2017
Citations: 18	License type: CC BY 4.0

Affiliation: King Abdulaziz University, Cairo University

Abstract

Analytical based approaches in Optical Character Recognition (OCR) systems can endure a significant amount of segmentation errors, especially when dealing with cursive languages such as the Arabic language with frequent overlapping between characters. Holistic based approaches that consider whole words as single units were introduced as an effective approach to avoid such segmentation errors. Still the main challenge for these approaches is their computation complexity, especially when dealing with large vocabulary applications. In this paper, we introduce a computationally efficient, holistic Arabic OCR system. A lexicon reduction approach based on clustering similar shaped words is used to reduce recognition time. Using global word level Discrete Cosine Transform (DCT) based features in combination with local block based features, our proposed approach managed to generalize for new font sizes that were not included in the training data. Evaluation results for the approach using different test sets from modern and historical Arabic books are promising compared with state of art Arabic OCR systems.

Highlights

Cursive scripts recognition has traditionally been handled by two major paradigms: a segmentationbased analytical approach and a word-based holistic approach
We propose a computationally efficient holistic Arabic Optical Character Recognition (OCR) system for a large vocabulary size
Firstly we find the Centre of Gravity (COG) of image and make it as the starting point; in order to calculate the centre of gravity, the horizontal and vertical centre must be determined by the following equations: M(1,0)

Summary

Introduction

Cursive scripts recognition has traditionally been handled by two major paradigms: a segmentationbased analytical approach and a word-based holistic approach. In a later work, Khorsheed [13] presented a cursive Arabic text recognition system based on HMM This system was segmentation-free with an easy-to-extract statistical features vector of length 60 elements, representing three different types of features. That system was trained with a multi-font data set that was selected randomly with same sample size from all fonts and tested with a data set consisting of 200 lines from each font, and achieved an accuracy of 95% using the tri-model In another effort, Krayem et al [14] presented a word level recognition system using discrete hidden Markov classifier along with a block based discrete cosine transform.

System Description

Feature Extraction

Lexical Reduction and Clustering

Language Rescoring

Experiment Results

Conclusions and Future Work