Abstract

Building a robust Optical Character Recognition (OCR) system for languages, such as Arabic with cursive scripts, has always been challenging. These challenges increase if the text contains diacritics of different sizes for characters and words. Apart from the complexity of the used font, these challenges must be addressed in recognizing the text of the Holy Quran. To solve these challenges, the OCR system would have to undergo different phases. Each problem would have to be addressed using different approaches, thus, researchers are studying these challenges and proposing various solutions. This has motivate this study to review Arabic OCR dataset because the dataset plays a major role in determining the nature of the OCR systems. State-of-the-art approaches in segmentation and recognition are discovered with the implementation of Recurrent Neural Networks (Long Short-Term Memory-LSTM and Gated Recurrent Unit-GRU) with the use of the Connectionist Temporal Classification (CTC). This also includes deep learning model and implementation of GRU in the Arabic domain. This paper has contribute in profiling the Arabic text recognition dataset thus determining the nature of OCR system developed and has identified research direction in building Arabic text recognition dataset.

Highlights

  • The input of an Optical Character Recognition (OCR) system is a page-image

  • Zayene et al (2018b) presented an Arabic video embedded text recognition system based on deep learning approach, they used MDLSTM network as input layers, so the MDLSTM learn the features from the raw input image, for the output layer they use the Connectionist Temporal Classification (CTC) with softmax activation function

  • We discovered that complex tasks, such as recognizing diacritical image texts (Quranic text) at the word or line level in Arabic OCR have not received much attention

Read more

Summary

INTRODUCTION

The input of an Optical Character Recognition (OCR) system is a page-image. The ACTIV Dataset (Zayene et al 2015) is a public dataset, which was extracted from 80 videos (more than 850,000 frames) collected from 4 different Arabic news channels It consists of 4,824 text lines with 21,520 words. Zayene et al (2018b) presented an Arabic video embedded text recognition system based on deep learning approach, they used MDLSTM network as input layers, so the MDLSTM learn the features from the raw input image, for the output layer they use the CTC with softmax activation function. The suggested method has been trained and evaluated using the AcTiV- R database which is part of AcTiv dataset consists of 10,415 text-lines images, 44,583 words. It was concluded that complex tasks, such as recognizing diacritical image texts (Quranic text) at word or line level has not received much attention and this could lead for future research directions in this area

Findings
DISCUSSION
SUMMARY

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.