Text recognition in bilingual machine printed image documents — Challenges and survey: A review on principal and crucial concerns of text extraction in bilingual printed images

Shalini Puri,Satya Prakash Singh

doi:10.1109/isco.2016.7727069

Abstract

In this digital world, accurate text identification and recognition has become an important key area of image document analysis and processing. Textual data, ranging from simple to complex images along with language variations — mono, bi, tri or multilingual scripts, is identified and extracted. This paper is designed to focus the challenges and complex issues of text recognition in bilingual machine printed imaged documents. Major crucial factors are discovered and mentioned which become the bottlenecks in correct and accurate recognition. With this, a hierarchical structure depicting three Classification Schemes (CS) A, B and C of bilingual printed imaged document is shown, where A, B and C are related to the content form, image mining and language or script determination. Some loopholes of OCR working are also discussed. To analyze the existing algorithms and methods, a survey is presented to focus on their critical issues, proposed solutions along with constraints and errors found during text processing. It leads to find out the shortcomings and limitations of different methods. Various specifications and factors found from the techniques are also shown as their characteristics and are compared relatively to distinguish them. It is observed that most of the existing methods are based on the classification schemes CS A-A1 and C-C1 and C2 and are designed for the script identification with 300 dpi gray scale image using SVM classifier.

Full Text