Amalgamated Approach for Devanagari Script Corpus for OCR &amp; Demographic Purpose and XML for Linguistic Annotation

Maninder Singh Nehra,Deepa Modi,Prakash Choudhary,Neeta Nain,Mushtaq Ahmed

doi:10.1109/sitis.2017.50

Abstract

In this paper, we present compilation of Hindi handwritten text image Corpus and its linguistics perspective in the field of OCR and information retrieval from handwritten document. Devnagari script is little bit complicated to enter a single character; it requires a combination of multiples, due to use of modifier. A mixed approach is proposed and demonstrated for Hindi Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like AADHAR, driving license, Railway Reservation etc. This would increase the participation of Hindi language community in understanding and taking benefit of the government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking and ZipF' s law to analyze the distribution and behavior of words in the corpus.

Full Text