Abstract

Part of speech (POS) tagging for the Indian language Maithili is not an explored territory. There have been substantial efforts at developing POS taggers in several Indian languages including Hindi, Bengali, Tamil, Telugu, Kannada, Punjabi and Marathi; but we did not find any openly available POS tagger and tagged corpus in Maithili. However, Maithili is one of the official languages of India with around 50 million native speakers. Development of Maithili natural language processing (NLP) tools and resources is extremely important as the language is currently being used in education and official contexts in certain states in India. In this paper, we present our effort on the development of a Maithili POS tagger. As we did not find any open training data, we started the development by annotation of a POS tagged corpus. We defined a POS tagset and manually annotated a Maithili corpus containing 52,190 words. We used the corpus to train a conditional random fields (CRF) classifier. We ran experiments using various feature sets and achieved an accuracy of 82.67%. Then we collected large raw corpora containing Wikipedia dump and other Maithili web resources to train neural word embedding. The word2vec CBOW model was trained and the generated word vectors were utilized during CRF training. With this inclusion, the accuracy of the system increased to 85.88%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.