Abstract

The pre-requisite for any Natural Language Processing (NLP) task, is the corpus. Corpus is defined as a large collection of structured text. Dogri is one of the official languages of India but is under-resourced in terms of computational resources needed for any NLP task. This paper proposes a methodology to construct a standard corpus which can be used for performing various language processing tasks like stemming, part-of-speech tagging, information retrieval, etc. The digitized text required for creating the corpus is not available due to the scarcity of online resources containing Dogri text. The only online source which is available is the Dogri Newspaper “Jammu Prabhat". Hence, the text is to be extracted from portable document formats (pdf) of that newspaper which are first converted to images before extraction of the text. To achieve this, an open-source tool-Tesseract is used for extracting the text from images. The methodology that is used for the corpus creation of Dogri Language is discussed in detail in the paper. The challenges faced during the research and the acquired results have also been discussed

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call