Abstract

Handwritten character recognition is a challenging research in the field of document image analysis over many decades due to numerous reasons such as large writing styles variation, inherent noise in data, expansive applications it offers, non-availability of benchmark databases etc. There has been considerable work reported in literature about creation of the database for several Indic scripts but the Tamil script is still in its infancy as it has been reported only in one database [5]. In this paper, we present the work done in the creation of an exhaustive and large unconstrained Tamil Handwritten Character Database (uTHCD). Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Offline samples were collected by asking volunteers to write samples on a form inside a specified grid. For online samples, we made the volunteers write in a similar grid using a digital writing pad. The samples collected encompass a vast variety of writing styles, inherent distortions arising from offline scanning process viz stroke discontinuity, variable thickness of stroke, distortion etc. Algorithms which are resilient to such data can be practically deployed for real time applications. The samples were generated from around 650 native Tamil volunteers including school going kids, homemakers, university students and faculty. The isolated character database will be made publicly available as raw images and Hierarchical Data File (HDF) compressed file. With this database, we expect to set a new benchmark in Tamil handwritten character recognition and serve as a launchpad for many avenues in document image analysis domain. Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.

Highlights

  • T AMILS or Tamilians is one of the world’s oldest surviving ethnolinguistic groups with a demographic population currently estimated to be around 76 million with a history of this language dating back over 2000 years [1], [2]

  • It can be noted that the performance of Adam, Nadam, and RMSProp is comparable to each other while they perform significantly better than AdaGrad and AdaDelta

  • The OCR models created using the uHTCD database will capture these inherent characteristics of a scanned document; thereby, robust performance is expected for automatic form processing

Read more

Summary

Introduction

T AMILS or Tamilians is one of the world’s oldest surviving ethnolinguistic groups with a demographic population currently estimated to be around 76 million with a history of this language dating back over 2000 years [1], [2]. Tamil Handwritten character recognition is one such challenging research topic for close to 4 decades [7], [8]. It continues to offer many challenges which keeps the research community active even till date [9], [10]. TAMIL SCRIPT The Tamil script contains 12 vowels, 18 consonants, and one special character known as Ayudha Ezhuthu. Additional five consonants known as Grantha Letters are borrowed from Sanskrit and English to represent words/syllables of north. The script contains 36 unique basic letters [12 vowels + 18 consonants + 1 Ayudha Ezhuthu + 5 Granthas].

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.