Abstract

This paper presents an efficient Optical Character Recognition (OCR) system for offline isolated Pashto characters recognition. Developing an OCR system for handwritten character recognition is a challenging task because of the handwritten characters vary both in shape and in style and most of the time the handwritten characters also vary among the individuals. The identification of the inscribed Pashto letters becomes even palling due to the unavailability of a standard handwritten Pashto characters database. For experimental and simulation purposes a handwritten Pashto characters database is developed by collecting handwritten samples from the students of the university on A4 sized page. These collected samples are then scanned, stemmed and preprocessed to form a medium sized database that encompasses 14784 handwritten Pashto character images (336 distinguishing handwritten samples for each 44 characters in Pashto script). Furthermore, the Zernike moments are considered as a feature extractor tool for the proposed OCR system to extract features of each individual character. Linear Discriminant Analysis (LDA) is followed as a recognition tool for the proposed recognition system based on the calculated features map using Zernike moments. Applicability of the proposed system is tested by validating it with 10-fold cross-validation method and an overall accuracy of 63.71% is obtained for the handwritten Pashto isolated characters using the proposed OCR system.

Highlights

  • In the last decades, a lot of research has been reported on machine learning and pattern identification problems

  • This paper presents an Optical Characters Recognition (OCR) system for offline isolated Pashto character using Zernike moments as feature extractor technique and Linear Discriminant Analysis (LDA) as a classification tool

  • For the proposed Handwritten Pashto Characters Recognition (HPCR) system, we have developed a handwritten Pashto characters database which is developed for input, Zernike moments is considered for feature extraction purposes, and linear discriminant analysis selection for recognition purpose

Read more

Summary

Introduction

A lot of research has been reported on machine learning and pattern identification problems. Optical Characters Recognition (OCR) is a significant problem of research for the researchers in the pattern recognition. State of the art techniques are suggested for different languages like English, Chinese, Arabic, Hindi, Dari, Persian and other around the world and high accuracy results are calculated for these languages. Cursive script languages like Arabic, Pashto and Urdu are the open research fields due to complexity in writing and word formation. Writing styles of these languages are varying for different peoples, and even it varies slightly for the same person on different occasions. These are the main problems that encounter hurdles in attaining state of the art performances in cursive-script based languages

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call