Abstract

The automated learning of morphological features of highly agglutinative languages is an important research area for both machine learning and computational linguistics. In this paper we present a novel morphology model that can solve the inflection generation and morphological analysis problems, managing all the affix types of the target language. The proposed model can be taught using (word, lemma, morphosyntactic tags) triples. From this training data, it can deduce word pairs for each affix type of the target language, and learn the transformation rules of these affix types using our previously published, lower-level morphology model called ASTRA. Since ASTRA can only handle a single affix type, a separate model instance is built for every affix type of the target language. Besides learning the transformation rules of all the necessary affix types, the proposed model also calculates the conditional probabilities of the affix type chains using relative frequencies, and stores the valid lemmas and their parts of speech. With these pieces of information, it can generate the inflected form of input lemmas based on a set of affix types, and analyze input inflected word forms. For evaluation, we use Hungarian data sets and compare the accuracy of the proposed model with that of state of the art morphology models published by SIGMORPHON, including the Helsinki (2016), UF and UTNII (2017), Hamburg, IITBHU and MSU (2018) models. The test results show that using a training data set consisting of up to 100 thousand random training items, our proposed model outperforms all the other examined models, reaching an accuracy of 98% in case of random input words that were not part of the training data. Using the high-resource data sets for the Hungarian language published by SIGMORPHON, the proposed model achieves an accuracy of about 95-98%.

Highlights

  • According to the theory of morphology and computational linguistics, words are built up from morphemes, that are the smallest morphological units with associated meaning [1]

  • In this paper we presented a novel multi-affix morphology model that can learn the morphology of highly agglutinative languages like Hungarian, and solve the inflection generation and morphological analysis problems, managing all the affix types of the target language

  • The proposed model calculates the conditional probability of all the possible affix type chains, stores the valid lemmas and their parts of speech, and trains a separate ASTRA model instance for each affix type, using a deduced set of word pairs demonstrating the transformation rules of the target affix type

Read more

Summary

Introduction

According to the theory of morphology and computational linguistics, words are built up from morphemes, that are the smallest morphological units with associated meaning [1]. The grammatically correct root form of a word is called the lemma, while the added morphemes that modify its base meaning are called affixes. Affixes may change some of the characters in the root form as well, resulting in for example vowel or consonant gradation. The process of adding affixes to a word is called inflection, while the inverse operation when we determine the lemma and the affixes of a word is called morphological analysis. In natural languages there are a finite number of affix types that determine the semantic meaning of the affixes, i.e. how the meaning of the base form is altered by them. Examples of affix types include accusative case, plural form, past tense, etc. The concrete appearance of affix types are affixes in the words

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call