Abstract

Document Image Analysis (DIA) is one of the research areas of Artificial Intelligence (AI) that converts document images into machine-readable codes. In DIA systems, Optical Character Recognition (OCR) plays a key role in digitizing document images. The output of an OCR system is further used in many applications including, Natural Language Processing (NLP), Sentiment Analysis, Speech Recognition, and Translation Services. However, standard datasets are an essential requirement for the development, evaluation and comparison of different text recognition techniques. Pashto is one of such low resource languages that lacks availability regarding standard dataset of handwritten text. This paper therefore, addresses the unavailability of standard dataset for the Pashto handwritten text by developing a dataset named Pashto Handwritten Text Imagebase (PHTI). The PHTI is created by collecting handwritten samples from diverse genre of the Pashto language including poetry, religion, short stories, articles, novels, sports, culture and news. The dataset consists of 4,000 scanned images, written by 400 writers including 200 males and 200 females. These 4,000 images are further segmented into 36,082 text-line images. Each text-line image is annotated/ transcribed with UTF-8 codecs. The dataset can be used for many deep learning-based applications including, text recognition, skew detection, gender classification and age-groups classification.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.