Abstract

Turkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.

Highlights

  • Morphological segmentation is the task of segmenting a word into its smallest meaning-bearing units called morphemes, whereas morphological generation is the reverse task of morphological segmentation, which is the task of generating various word forms from a given stem

  • We mainly investigate morphological generation, which plays a significant role in many natural language applications such as machine translation, question answering, language generation, dialog systems

  • The number of the generated words dropped when Morfessor CatMAP was used because the morphological system undersplits the words and finite-state automata (FSA) tend to be more shallow when compared to the supervised setting

Read more

Summary

Introduction

Morphological segmentation is the task of segmenting a word into its smallest meaning-bearing units called morphemes, whereas morphological generation is the reverse task of morphological segmentation, which is the task of generating various word forms from a given stem. The word disgraceful can be split into dis, grace, and ful, which is called morphological segmentation and morphological generation is the task of generating disgrace, disgraceful, disgracefully, ungraceful, ungracefully and many others by using the stem grace. It is the task of not generating invalid word forms such as gracelyful or disungraceful. This brings the challenge in morphological generation. Translation of a phrase with a rich morphology requires the right word form to be

Objectives
Methods
Findings
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call