Abstract

This paper presents a technique for optical recognition of Urdu characters using template matching based on a probabilistic N-Gram language model. Dataset used has the collection of both printed and typed text. This model is able to perform three types of segmentations including line, ligature and character using horizontal projection, connected component labeling, corners and pointers techniques, respectively. A separate stochastic lexicon is built from a collected corpus, which contains the probability values of grams. By using template matching and the N-Gram language model, our study predicts complete segmented words with the promising result, particularly in case of bigrams. It outperforms three out of four existing models with an accuracy rate of 97.33%. Results achieved on our test dataset are encouraging in one perspective but provide direction to work for further improvement in this model.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.