MACHINE LEARNING OF MORPHOSYNTACTIC STRUCTURE: LEMMATIZING UNKNOWN SLOVENE WORDS

Tomaž Erjavec,Sasčo Džeroski

doi:10.1080/08839510490250088

Abstract

Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, since word-forms cannot be matched against a morphological lexicon. This paper discusses a machine learning approach to the automatic lemmatization of unknown words in Slovene texts. We decompose the problem of learning to perform lemmatization into two subproblems: learning to perform morphosyntactic tagging of words in a text, and learning to perform morphological analysis, which produces the lemma from the word-form given the correct morphosyntactic tag. A statistics-based trigram tagger is used to learn morphosyntactic tagging and a first-order decision list learning system is used to learn rules for morphological analysis. We train the tagger on a manually annotated corpus consisting of 100,000 running words. We train the analyzer on open-class inflecting Slovene words, namely nouns, adjectives, and main verbs, together being characterized by more than 400 different morphosyntactic tags. The training set for the analyzer consists of a morphological lexicon containing 15,000 lemmas. We evaluate the learned model on word lists extracted from a corpus of Slovene texts containing 500,000 words, and show that our morphological analysis module achieves 98.6% accuracy, while the combination of the tagger and analyzer is 92.0% accurate on unknown inflecting Slovene words.

Full Text