Abstract

Two-Level Morphology with Composition Lauri Karttunen, Ronald M. Kaplan, and Annie Zaenen Xerox Palo Alto Research Center Center for the Study of language and Information StanJbrd University 1. Limitations of Kimmo systems The advent of two-level morphology (Koskenniemi [1], Karttunen [2], Antworth [3], Ritchie et al. [4]) has made it relatively easy to develop adequate morphological (or at least morphographical) descriptions for natural languages, clearly superior to earlier cut-and-paste approaches to mor- phology. Most of the existing Kimmo systems developed within this paradigm consist of • linked lexicons stored as annotated letter trees • morphological information on the leaf nodes of trees • transducers that encode morphological alternations An analysis of an inflected word form is produced by mapping the input form to a sequence of lexical forms through the transducers and by composing some out- put from the annotations on the leaf nodes of the lexical paths that were traversed. Comprehensive morphological descrip- tions of this type have been developed for several languages including Finnish, Swedish, Russian, English, Swahili, and Arabic. Although they have several good features, these Kimmo-systems also have some limitations. The ones we want to ad- dress in this paper are the following: (1) Lexical representations tend to be arbitrary. Because it is difficult to write and test two-level systems that map between pairs of radically dissimilar forms, lexical representations in existing two-level analyzers tend to stay close to the surface forms. This is not a problem for morpho- logically simple languages like English because, for most words, inflected forms are very similar to the canonical dictionary entry. Except for a small number of irregular verbs and nouns, it is not difficult to create a two-level description for English in which lexical forms coincide with the canonical citation forms found in a dictionary. However, current analyzers for mor- phologically more complex languages (Finnish and Russian, for example) are not as satisfying in this respect. In these systems, lexical forms typically contain diacritic markers and special symbols; they are not real words in the language. For example, in Finnish the lexical counterpart of otin 'I took' might be rendered as otTallln, where T, al, and I1 are an arbitrary encoding of morpho- logical alternations that determine the allomorphs of the stem and the past tense morpheme. The canonical citation form ottaa 'to take' is composed from annotations on the leaf nodes of the letter trees that are linked to match the input. It is not in any direct way related to the lexical form produced by the transducers. (2) Morphological categories are not directly encoded as part of the lexical form. Instead of morphemes like Plural or Past, we typically see suffix strings like +s, and +ed, which do not by themselves indi- cate what morpheme they express. Different realizations of the same morpho- logical category are often represented as different even on the lexical side. These characteristics lead to some un- desirable consequences: ACRES DE COLING-92, NANTES, 23-28 AO~' 1992 1 4 1 PROC. OF COLING-92, NA~rr~s, AU6.23-28, 1992

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.