Abstract

We introduce Morse, a recurrent encoder-decoder model that produces morphological analyses of each word in a sentence. The encoder turns the relevant information about the word and its context into a fixed size vector representation and the decoder generates the sequence of characters for the lemma followed by a sequence of individual morphological features. We show that generating morphological features individually rather than as a combined tag allows the model to handle rare or unseen tags and to outperform whole-tag models. In addition, generating morphological features as a sequence rather than, for example, an unordered set allows our model to produce an arbitrary number of features that represent multiple inflectional groups in morphologically complex languages. We obtain state-of-the-art results in nine languages of different morphological complexity under low-resource, high-resource, and transfer learning settings. We also introduce TrMor2018, a new high-accuracy Turkish morphology data set. Our Morse implementation and the TrMor2018 data set are available online to support future research. 1 See https://github.com/ai-ku/Morse.jl for a Morse implementation in Julia/Knet (Yuret, 2016 ) and https://github.com/ai-ku/TrMor2018 for the new Turkish data set.

Highlights

  • 1 Introduction possible morphological analyses: the accusative and possessive forms of the stem ‘‘masal’’ and the +With form of the stem ‘‘masa’’, all expressed with the same surface form (Oflazer, 1994)

  • We have experimented with other inputoutput formats, as described in Section 5: We found that jointly producing the lemma and the morphological features is more difficult than producing only morphological features in lowresource settings but gives similar performance in high-resource settings

  • The results demonstrate that Morse, generating analyses with its sequence decoder, significantly outperforms the state of the art in low-resource, high-resource, and transfer-learning experiments

Read more

Summary

Introduction

1 Introduction possible morphological analyses: the accusative and possessive forms of the stem ‘‘masal’’ (tale) and the +With form of the stem ‘‘masa’’ (table), all expressed with the same surface form (Oflazer, 1994). Oflazer et al (1999) observes that words in Turkish can have dependencies to any one of the inflectional groups of a derived word: in ‘‘mavi masalı oda’’ (room with a blue table) the adjective ‘‘mavi’’ (blue) modifies the noun root ‘‘masa’’ (table) even though the final part of speech of ‘‘masalı’’ is an adjective. This dependency would be difficult to represent without a detailed analysis of morphology. Morse performs lemmatization and tagging jointly by default; we report on separating the two tasks

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.