BaNeL: an encoder-decoder based Bangla neural lemmatizer

Md Ashraful Islam,Md Towhiduzzaman,Md Tauhidul Islam Bhuiyan,Jesan Ahammed Ovi,Abdullah Al Maruf

doi:10.1007/s42452-022-04985-2

Md Ashraful Islam, Md Towhiduzzaman + Show 3 more

https://doi.org/10.1007/s42452-022-04985-2

Copy DOI

Journal: SN applied sciences	Publication Date: Apr 9, 2022
Citations: 3	License type: open-access

Affiliation: University of Dhaka, East West University

Abstract

This study presents an efficient framework of deriving lemma from an inflected Bangla word considering its parts-of-speech as context. Bangla is a morphologically rich Indo-Aryan language where around 70% words are inflected, and some words have around 90 different inflected forms making it one of the most challenging languages for lemmatization. The unavailability of a sufficiently large appropriate dataset in Bangla makes the task even more strenuous. A reliable robust Bangla lemmatizer will create new possibilities for other dependent fields like automatic language translation and grammatical correction to flourish in Bangla. In this paper, we have described a new larger Bangla dataset for lemmatization and an encoder-decoder-based sequence_to_sequence framework for it. After tuning the hyper-parameters, the proposed framework yielded 95.75% character accuracy and 91.81% exact match on the testing split of the prepared dataset which is significantly higher than existing other approaches in Bangla for lemmatization.Article HighlightsThis article:Discusses lemmatization task in Bangla and demonstrates difference with stemmingPresents an artificial neural network based efficient model for lemmatization that yields comparatively better performance than existing onesDescribes a new large dataset for lemmatization in Bangla language

Full Text