Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

Saima Shaukat,Asmara Akram,Muhammad Asad

doi:10.3390/app13085103

Saima Shaukat, Asmara Akram + Show 1 more

Open Access

https://doi.org/10.3390/app13085103

Copy DOI

Journal: Applied sciences	Publication Date: Apr 19, 2023
Citations: 2	License type: CC BY 4.0

Affiliation: The University of Tokyo, University of Lahore

Abstract

Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach.

Full Text