Towards the first Maithili part of speech tagger: Resource creation and system development

Ankur Priyadarshi,Sujan Kumar Saha

doi:10.1016/j.csl.2019.101054

Abstract

Part of speech (POS) tagging for the Indian language Maithili is not an explored territory. There have been substantial efforts at developing POS taggers in several Indian languages including Hindi, Bengali, Tamil, Telugu, Kannada, Punjabi and Marathi; but we did not find any openly available POS tagger and tagged corpus in Maithili. However, Maithili is one of the official languages of India with around 50 million native speakers. Development of Maithili natural language processing (NLP) tools and resources is extremely important as the language is currently being used in education and official contexts in certain states in India. In this paper, we present our effort on the development of a Maithili POS tagger. As we did not find any open training data, we started the development by annotation of a POS tagged corpus. We defined a POS tagset and manually annotated a Maithili corpus containing 52,190 words. We used the corpus to train a conditional random fields (CRF) classifier. We ran experiments using various feature sets and achieved an accuracy of 82.67%. Then we collected large raw corpora containing Wikipedia dump and other Maithili web resources to train neural word embedding. The word2vec CBOW model was trained and the generated word vectors were utilized during CRF training. With this inclusion, the accuracy of the system increased to 85.88%.

Full Text