On the Role of Morphological Information for Contextual Lemmatization

Olia Toporkov,Rodrigo Agerri

doi:10.1162/coli_a_00497

Abstract

Abstract Lemmatization is a natural language processing (NLP) task that consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this article we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish, and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, lastly, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Linguistics	Publication Date: Mar 1, 2024
Citations: 2	License type: CC BY-NC-ND 4.0

R Discovery Prime

R Discovery Prime

On the Role of Morphological Information for Contextual Lemmatization

Abstract

Talk to us

Similar Papers

More From: Computational Linguistics

Lead the way for us

Similar Papers

Towards a Novel Weakly Supervised Joint Approach of Named Entity Recognition and Normalization for Noisy Text
Assia Mezhar ... Mohammed Ramdani
SSRN Electronic Journal | VOL. -
Assia Mezhar, et. al.Assia Mezhar ... Mohammed Ramdani
01 Jan 2018
SSRN Electronic Journal | VOL. -

Towards a Novel Weakly Supervised Joint Approach of Named Entity Recognition and Normalization for Noisy Text
Assia Mezhar ... Amal El Mzabi
SSRN Electronic Journal | VOL. -
Assia Mezhar, et. al.Assia Mezhar ... Amal El Mzabi
09 May 2018
SSRN Electronic Journal | VOL. -

Improving Chinese Named Entity Recognition by Interactive Fusion of Contextual Representation and Glyph Representation
Ruiming Gu ... Jianfeng Deng
Applied Sciences | VOL. 13
Ruiming Gu, et. al.Ruiming Gu ... Jianfeng Deng
28 Mar 2023
Applied Sciences | VOL. 13

Reverse Transfer Learning: Can Word Embeddings Trained for Different NLP Tasks Improve Neural Language Models?
Lyan Verwimp ... Jerome R Bellegarda
-
Lyan Verwimp, et. al.Lyan Verwimp ... Jerome R Bellegarda
15 Sep 2019
15 Sep 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the Role of Morphological Information for Contextual Lemmatization

Abstract

Talk to us

Similar Papers

More From: Computational Linguistics