UTHCD: A New Benchmarking for Tamil Handwritten OCR

Noushath Shaffi,Faizal Hajamohideen

doi:10.1109/access.2021.3096823

Abstract

Handwritten character recognition is a challenging research in the field of document image analysis over many decades due to numerous reasons such as large writing styles variation, inherent noise in data, expansive applications it offers, non-availability of benchmark databases etc. There has been considerable work reported in literature about creation of the database for several Indic scripts but the Tamil script is still in its infancy as it has been reported only in one database [5]. In this paper, we present the work done in the creation of an exhaustive and large unconstrained Tamil Handwritten Character Database (uTHCD). Database consists of around 91000 samples with nearly 600 samples in each of 156 classes. The database is a unified collection of both online and offline samples. Offline samples were collected by asking volunteers to write samples on a form inside a specified grid. For online samples, we made the volunteers write in a similar grid using a digital writing pad. The samples collected encompass a vast variety of writing styles, inherent distortions arising from offline scanning process viz stroke discontinuity, variable thickness of stroke, distortion etc. Algorithms which are resilient to such data can be practically deployed for real time applications. The samples were generated from around 650 native Tamil volunteers including school going kids, homemakers, university students and faculty. The isolated character database will be made publicly available as raw images and Hierarchical Data File (HDF) compressed file. With this database, we expect to set a new benchmark in Tamil handwritten character recognition and serve as a launchpad for many avenues in document image analysis domain. Paper also presents an ideal experimental set-up using the database on convolutional neural networks (CNN) with a baseline accuracy of 88% on test data.

Highlights

T AMILS or Tamilians is one of the world’s oldest surviving ethnolinguistic groups with a demographic population currently estimated to be around 76 million with a history of this language dating back over 2000 years [1], [2]
It can be noted that the performance of Adam, Nadam, and RMSProp is comparable to each other while they perform significantly better than AdaGrad and AdaDelta
The OCR models created using the uHTCD database will capture these inherent characteristics of a scanned document; thereby, robust performance is expected for automatic form processing

Summary

Introduction

T AMILS or Tamilians is one of the world’s oldest surviving ethnolinguistic groups with a demographic population currently estimated to be around 76 million with a history of this language dating back over 2000 years [1], [2]. Tamil Handwritten character recognition is one such challenging research topic for close to 4 decades [7], [8]. It continues to offer many challenges which keeps the research community active even till date [9], [10]. TAMIL SCRIPT The Tamil script contains 12 vowels, 18 consonants, and one special character known as Ayudha Ezhuthu. Additional five consonants known as Grantha Letters are borrowed from Sanskrit and English to represent words/syllables of north. The script contains 36 unique basic letters [12 vowels + 18 consonants + 1 Ayudha Ezhuthu + 5 Granthas].

Objectives

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2021
Citations: 16	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

UTHCD: A New Benchmarking for Tamil Handwritten OCR

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Isolated Handwritten Tamil Character Recognition using Convolutional Neural Networks
Nagul Ulaganathan ... Rohith J
-
Nagul Ulaganathan, et. al.Nagul Ulaganathan ... Rohith J
03 Dec 2020
03 Dec 2020

Recognizing Ancient Characters from Tamil Palm Leaf Manuscripts using Convolution Based Deep Learning
Kavitha Subramani* ... S Murugavalli
International Journal of Recent Technology and Engineering (IJRTE) | VOL. 8
Kavitha Subramani*, et. al.Kavitha Subramani* ... S Murugavalli
30 Sep 2019
International Journal of Recent Technology and Engineering (IJRTE) | VOL. 8

An adaptive technique for handwritten Tamil character recognition
K Sarveswaran ... D.A.A.C Ratnaweera
-
K Sarveswaran, et. al.K Sarveswaran ... D.A.A.C Ratnaweera
01 Nov 2007
01 Nov 2007

Recognizing Handwritten Offline Tamil Character by using cGAN & CNN
N Sasipriyaa ... R S Arwin Prakadis
-
N Sasipriyaa, et. al.N Sasipriyaa ... R S Arwin Prakadis
07 Apr 2022
07 Apr 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

UTHCD: A New Benchmarking for Tamil Handwritten OCR

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access