A complete printed Bangla OCR system

B.B Chaudhuri,U Pal

doi:10.1016/s0031-3203(97)00078-2

Abstract

A complete Optical Character Recognition (OCR) system for printed Bangla, the fourth most popular script in the world, is presented. This is the first OCR system among all script forms used in the Indian sub-continent. The problem is difficult because (i) there are about 300 basic, modified and compound character shapes in the script, (ii) the characters in a word are topologically connected and (iii) Bangla is an inflectional language. In our system the document image captured by Flat-bed scanner is subject to skew correction, text graphics separation, line segmentation, zone detection, word and character segmentation using some conventional and some newly developed techniques. From zonal information and shape characteristics, the basic, modified and compound characters are separated for the convenience of classification. The basic and modified characters which are about 75 in number and which occupy about 96% of the text corpus, are recognized by a structural-feature-based tree classifier. The compound characters are recognized by a tree classifier followed by template-matching approach. The feature detection is simple and robust where preprocessing like thinning and pruning are avoided. The character unigram statistics is used to make the tree classifier efficient. Several heuristics are also used to speed up the template matching approach. A dictionary-based error-correction scheme has been used where separate dictionaries are compiled for root word and suffixes that contain morpho-syntactic informations as well. For single font clear documents 95.50% word level (which is equivalent to 99.10% character level) recognition accuracy has been obtained. Extension of the work to Devnagari, the third most popular script in the world, is also discussed.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A complete printed Bangla OCR system

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition

Lead the way for us

Journal: Pattern Recognition	Publication Date: Mar 1, 1998
Citations: 343

Similar Papers

Soft Computing Techniques for Optical Character Recognition Systems
Arindam Chaudhuri ... Pratixa Badelia
-
Arindam Chaudhuri, et. al.Arindam Chaudhuri ... Pratixa Badelia
24 Dec 2016
24 Dec 2016

JPEG for Arabic Handwritten Character Recognition: Add a Dimension of Application
Abdurazzag Ali ... Salem Ali
-
Abdurazzag Ali, et. al.Abdurazzag Ali ... Salem Ali
01 Oct 2008
01 Oct 2008

Offline optical character recognition (OCR) method: An effective method for scanned documents
Mujibur Rahman Majumder ... Mahbubul Alam
-
Mujibur Rahman Majumder, et. al.Mujibur Rahman Majumder ... Mahbubul Alam
01 Dec 2019
01 Dec 2019

Line, Word, and Character Segmentation from Bangla Handwritten Text—A Precursor Toward Bangla HOCR
Payel Rakshit ... Kaushik Roy
-
Payel Rakshit, et. al.Payel Rakshit ... Kaushik Roy
01 Jan 2018
01 Jan 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A complete printed Bangla OCR system

Abstract

Talk to us

Similar Papers

More From: Pattern Recognition