Multilingual Character Segmentation and Recognition Schemes for Indian Document Images

Parul Sahare,Sanjay B Dhok

doi:10.1109/access.2018.2795104

Parul Sahare, Sanjay B Dhok

Open Access

https://doi.org/10.1109/access.2018.2795104

Copy DOI

Journal: IEEE Access	Publication Date: Jan 1, 2018
Citations: 117	License type: cc-by-nc-nd

Affiliation: Visvesvaraya National Institute of Technology

Abstract

In this paper, robust algorithms for character segmentation and recognition are presented for multilingual Indian document images of Latin and Devanagari scripts. These documents generally suffer from their layout organizations, local skews, and low print quality and contain intermixed texts (machine-printed and handwritten). In the proposed character segmentation algorithm, primary segmentation paths are obtained using structural property of characters, whereas overlapped and joined characters are separated using graph distance theory. Finally, segmentation results are validated using highly accurate support vector machine classifier. For the proposed character recognition algorithm, three new geometrical shape-based features are computed. First and second features are formed with respect to the center pixel of character, whereas neighborhood information of text pixels is used for the calculation of third feature. For recognizing the input character, $k$ -Nearest Neighbor classifier is used, as it has intrinsically zero training time. Comprehensive experiments are carried out on different databases containing printed as well as handwritten texts. Benchmarking results illustrate that proposed algorithms have better performances compared to other contemporary approaches, where highest segmentation and recognition rates of 98.86% and 99.84%, respectively, are obtained.

Full Text