Offline Handwritten Telugu Character Dataset and Recognition

Atul Negi,Anish M Rao

doi:10.1109/indicon47234.2019.9028977

Abstract

Telugu is a Dravidian Language spoken mainly in Southern parts of India. It has close to 81 million native speakers, making it the fifteenth most widely-spoken language in the world. Here we present a comprehensive database of handwritten Telugu characters to drive progress in handwriting recognition for this script. We claim that this is significant since we have put together the largest set of vowel, consonant, vowel-consonant and consonant-consonant pairs of the Telugu orthography. This work produces such a database with real-world offline handwritten characters extracted from scanned documents, making it the largest and most varied database in this domain. The method of collecting data, preprocessing steps, as well as the extraction approach to obtain individual Telugu characters is explained in detail. The dataset is also made open to use as a test set to evaluate handwriting recognition approaches and other related tasks. This work also presents a method of handwritten Telugu character recognition using Convolutional Neural Networks as a baseline classifier, as well as Visual Attention Networks as a more advanced and effective solution. Finally, the proposed architecture is compared with previous solutions and the results are discussed.

Full Text