Development of Part-of-Speech tagger for a low-resource endangered language

Toshal Gore,Vaibhav Khatavkar

doi:10.1109/icac3n56670.2022.10074031

Abstract

India is one of the multilingual countries where large number of languages are spoken, major languages being Hindi, Bengali and Marathi. Indian languages have limited research done in the Natural Language Processing (NLP) domain. This is because Indian languages use Brahmic script alphabets, instead of Latin alphabets, which is very difficult for NLP to understand and process. Most of the Indian languages have many dialects and also have many distinct linguistic characteristics as compared to English. Also, there are many Indian languages which are on the verge of extinction and there is very little progress done on NLP for such languages in order to preserve them. The size of dataset available for such low resource languages is very small. One such language is Katkari, which is an endangered Indian tribal language, and a dialect of Marathi language. The purpose of this work is to develop a Part-of-Speech (POS) tagger for Katkari language. POS tagging is a technique in which each word in the text is assigned a POS label based on its context. The POS taggers for several Indian languages are developed, but for Katkari language, work is yet to be done. Hence, this paper presents a POS tagger for Katkari language which is built with the help of Hidden Markov Model (HMM) and Viterbi algorithm. The Katkari POS tagger was compared with POS taggers of other Indian languages and the accuracy of the Katkari POS tagger was obtained as 86.84%.

Full Text