Abstract

In this paper, we present compilation of Hindi handwritten text image Corpus and its linguistics perspective in the field of OCR and information retrieval from handwritten document. Devnagari script is little bit complicated to enter a single character; it requires a combination of multiples, due to use of modifier. A mixed approach is proposed and demonstrated for Hindi Corpus for OCR and Demographic data collection. Demographic part of database could be used to train a system to fetch the data automatically, which will be helpful to simplify existing manual data-processing task involved in the field of data collection such as input forms like AADHAR, driving license, Railway Reservation etc. This would increase the participation of Hindi language community in understanding and taking benefit of the government schemes. To make availability and applicability of database in a vast area of corpus linguistics, we propose a methodology for data collection, mark-up, digital transcription, and XML metadata information for benchmarking and ZipF' s law to analyze the distribution and behavior of words in the corpus.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.