A Robust OCR for Degraded Documents

Kapil Dev Dhingra,Sudip Sanyal,Pramod Kumar Sharma

doi:10.1007/978-0-387-74938-9_34

Abstract

In the last two decades, many advances have been made in the field of document image analysis and recognition. In the recent past, several methods for recognizing Latin, Chinese, Japanese, and Arabic scripts have been proposed [7–9]. Until now, most of the OCR work has concentrated on high quality images and great success has been achieved by character recognition systems. Apart from these successes, there still exist two challenging problems in the field of recognition. The first one is optical character recognition (OCR) for low-quality images. Images having luminance variations, noise, and random degradation of text are difficult to read by OCR systems. The second open problem is that of recognizing off-line cursive handwritten character recognition [15]. Our work concentrates on the former one particularly for Devanagari script, which is the script for Hindi, Nepali, Marathi, and several other Indic languages. Together, these languages have a user base exceeding 500 million people. A great deal of effort has been made towards the development of OCR for Indian scripts [1–3, 10]. This chapter is concerned with the recognition of degraded Devanagri text documents. A major contribution in the area of Devanagari OCR is the Hindi OCR system developed by Chaudhari [2]. As remarked earlier, a majority of the above mentioned work has concentrated primarily on good quality document images and little work has been reported so far for the development of OCR for degraded Devanagari document images. Jawahar [4] presented a scheme based on SVM classifier, but this work was not focused on degraded document images. A major contribution towards the development of OCR for degraded documents is OCR for Chinese characters [13]. Apart from the advances that have been made towards

Full Text