Abstract

Urdu text is a complex cursive script and poses a challenge for recognition by OCR systems due to its large number of ligatures and cursive style. In literature, several techniques have been proposed to recognize Urdu ligatures. However, we have investigated that, suitable challenging datasets and the consequently higher recognition rate is needed for ligature recognition. In this paper, a hybrid model based on the holistic approach is adopted for the recognition of Urdu ligatures (compound characters). More than 3800 unique ligatures were used to generate 46K (38K training, 7K testing) synthetic ligatures with 9 different kinds of transformations along with the normal ligatures. Each ligature is processed through two streams of Deep Neural Networks, namely Alexnet and Vgg16 to obtain a unique set of features corresponding to each net. These features are fused and then used as an input to double layer Bidirectional Long Short Term (BLSTM) network for learning a model. The learned model maps ligature images to their corresponding sequence of individual Urdu characters. In the proposed methodology output is in the editable Urdu-script format. The proposed model was evaluated and have shown an accuracy of 97% on the training dataset and 80% on more than 7K parametrically different query ligatures (test-set).

Highlights

  • Urdu is the national language of Pakistan and 6-Indian states [1], covering more than 260 million people

  • Two kinds of synthetic images were generated from the ligatures text of CLE dataset, one for training and another for testing

  • We have performed t-SNE visualization to understand the complexity of the dataset

Read more

Summary

Introduction

Urdu is the national language of Pakistan and 6-Indian states [1], covering more than 260 million people. Script recognition is an essential part of any simple/Photo OCR system. OCRs are generally categorized into two categories: offline systems [1]–[3] and online systems [4], [5]. Offline means at a later stage: essentially recognizing text from printed or photo text, while online means the text is recognized as soon as it is written usually on tablets/smartphones. An OCR system for the Urdu language has different writing styles for Urdu script/text, multiple size ligatures, and image degradations. Along with these variations, the presence of diacritics in Urdu script results in low recognition rates [6], [7]. Urdu has two main commonly used writing styles i.e., Naskh and Nastalique [8] besides others

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call