Abstract

Morphological analysis and disambiguation is an important task and a crucial preprocessing step in natural language processing of morphologically rich languages. Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis. While linguistically curated finite state tools can be easily developed for morphological analysis, the morphological richness of the language allows many ambiguous analyses to be produced, requiring effective disambiguation. In this paper, we propose learning to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing. Using feature engineering and a feed-forward neural network based classifier, we achieve about 89% non-contextualized disambiguation accuracy. Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.

Highlights

  • Morphological analysis and disambiguation plays a critical role in most natural language processing (NLP) tasks

  • When inflections are generated by piecing together multiple morphemes, a large and sparse vocabulary is produced, requiring tools to unpack the individual morphemes for downstream NLP tasks such information extraction and machine translation

  • Research on NLP for low resource languages lags behind recent advancements made for NLP on high resource languages

Read more

Summary

Introduction

Morphological analysis and disambiguation plays a critical role in most natural language processing (NLP) tasks. While several morphologically rich languages such as Turkish, Arabic and Modern Hebrew already have mature tools for morphological segmentation (Coltekin, 2010) (Co ̈ltekin, 2014) (Itai and Segal, 2003) (Habash and Rambow, 2006), Kinyarwanda still lacks appropriate tools for the task. A key limitation in the effort is the need to have high quality datasets manually annotated by language experts. We leverage an easy to collect stemming dataset and transform it into a resource for morphological disambiguation. Collecting stemming data is much faster and less prone to errors than full morphological segmentations which require subtle linguistic knowledge

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call