Abstract

All the Urdu literature is in the form of manuscripts and typewritten books.There is a need for converting all these physical libraries into electronic libraries. Various OCRs have been developed for different languages and are widely used. Building a complete Urdu OCR is a difficult task because Urdu is highly cursive language, where ligatures overlap and style variation poses challenges to the recognition system. We are describing a technique for automatic recognition of off-line printed Urdu text using Hidden Markov Models. Our method does not require segmentation into characters and considers each shape of Urdu character as different class resulting in a total of 196 classes (compared to 38 Urdu letters). This paper presents a novel feature extraction method based on sliding window technique, using only 16 statistical features from each sliding window thereby eliminating the need for segmentation of Urdu text. The dependency of Recognition rate of Urdu script upon, the number of states of HMM, different sizes of hierarchical window and different fonts is presented. We are using HTK (Hidden Markov Model Toolkit) for training, recognition and result analysis.

Highlights

  • Optical character recognition, abbreviated as OCR, is the technique that converts scanned images of handwritten, typewritten or printed text into the machine-encoded form that can be processed, edited, searched, saved, and copied for an unlimited number of times without any degradation or loss of information using a computer

  • 4) Results on different Fonts: Five different Urdu fonts were used for recognition and testing.Table4 summarizes the results of Akhbar, Andalus, Naskh and Arial fonts

  • Future Work: There are many extensions that can be done either to enhance the performance of the system or to make the approach applicable to a wider range of tasks related to Urdu text Recognition

Read more

Summary

Introduction

Optical character recognition, abbreviated as OCR, is the technique that converts scanned images of handwritten, typewritten or printed text into the machine-encoded form that can be processed, edited, searched, saved, and copied for an unlimited number of times without any degradation or loss of information using a computer. Segmenting the script into characters is very difficult and complex procedure. It always generates errors, resulting in low recognition rates. The method does not require segmentation into characters and is applied to cursive Urdu script, where ligatures, overlaps and style variation pose challenges to the recognition system. Character recognition for Urdu script faces challenges mainly due to its characteristics like cursive nature, multiple fonts, context-dependent shapes of characters and their position with respect to the baseline. These obstacles have played an important role in delaying character recognition

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call