Abstract

We propose a novel language-independent framework for inducing a collection of morphological inflection classes from a monolingual corpus of full form words. Our approach involves two main stages. In the first stage, we generate a large data structure of candidate inflection classes and their interrelationships. In the second stage, search and filtering techniques are applied to this data structure, to identify a select collection of true inflection classes of the language. We describe the basic methodology involved in both stages of our approach and present an evaluation of our baseline techniques applied to induction of major inflection classes of Spanish. The preliminary results on an initial training corpus already surpass an F1 of 0.5 against ideal Spanish inflectional morphology classes.

Highlights

  • Many natural language processing tasks, such as morphological analysis and parsing, have mature solutions when applied to resource-rich European and Asian languages

  • The novel proposal we bring to the table, is a formalization of the full search space of all candidate inflection classes

  • When learning the morphology of a foreign language, it is common for a student to study tables of inflection classes

Read more

Summary

Introduction

Many natural language processing tasks, such as morphological analysis and parsing, have mature solutions when applied to resource-rich European and Asian languages. Addressing these same tasks in less studied low-density languages, poses exciting challenges. While low-density languages abound, comparatively little financial resources are available to address their challenges These considerations suggest developing systems to automatically induce solutions for NLP tasks in new languages. The AVENUE project (Lavie et al, 2003; Carbonell et al, 2002; Probst et al, 2002) at Carnegie Mellon University seeks to apply automatic induction methods to develop rule-based machine translation systems between pairs of languages where one of the languages is low-density and the other is resource-rich. All experiments detailed in this paper are over a Spanish newswire corpus of 40,011 tokens and 6,975 types

Previous Work
Inflection Classes as Motivation
Empirical Inflection Classes
Candidate Inflection Class Search Space
Search
Vertical Only
Horizontal Blocking
Evaluation
Results and Error
Future Work

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.