Abstract

This study presents an efficient framework of deriving lemma from an inflected Bangla word considering its parts-of-speech as context. Bangla is a morphologically rich Indo-Aryan language where around 70% words are inflected, and some words have around 90 different inflected forms making it one of the most challenging languages for lemmatization. The unavailability of a sufficiently large appropriate dataset in Bangla makes the task even more strenuous. A reliable robust Bangla lemmatizer will create new possibilities for other dependent fields like automatic language translation and grammatical correction to flourish in Bangla. In this paper, we have described a new larger Bangla dataset for lemmatization and an encoder-decoder-based sequence_to_sequence framework for it. After tuning the hyper-parameters, the proposed framework yielded 95.75% character accuracy and 91.81% exact match on the testing split of the prepared dataset which is significantly higher than existing other approaches in Bangla for lemmatization.Article HighlightsThis article:Discusses lemmatization task in Bangla and demonstrates difference with stemmingPresents an artificial neural network based efficient model for lemmatization that yields comparatively better performance than existing onesDescribes a new large dataset for lemmatization in Bangla language

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call