Stem-Affix based Uyghur Morphological Analyzer

Mijit Ablimit,Akbar Pattar,Askar Hamdulla,Tatsuya Kawahara

doi:10.14257/ijfgcn.2016.9.2.07

Abstract

Uyghur language is an agglutinative language in which words are derived from stems (or roots) by concatenating suffixes. This property makes a large number of combinations of morphemes, and greatly increases the word-vocabulary size, causing out-of-vocabulary (OOV) and data sparseness problems for statistical models. So words are split into certain sub-word units and applied to text and speech processing applications. Proper sub-word units not only provide high coverage and smaller lexicon size, but also provide semantic and syntactic information which is necessary for downstream applications. This paper discusses a general purpose morphological analyzer tool which can split a text of words into sequence of morphemes or syllables. Uyghur morpheme segmentation is a basic part of the comprehensive effort of the Uyghur language corpus compilation. As there are no delimiters for sub-word units, a supervised method, combined with certain rules and a statistical learning algorithm, is applied for morpheme segmentation. For phonetic units like syllable and phonemes, pure rule-based methods can extract with high accuracy. Most common and proper sub-words for various applications can be the linguistic morphemes for they provide linguistic information, high coverage, low lexicon size, and easily be restored to words. As the Uyghur language is written as pronounced, phonetic alterations of speech are openly expressed in text. This property makes many surface forms for a particular morpheme. A general purpose morphological analyzer must be able to analyze and export in both standard and surface forms. So the morphophonetic alterations like phonetic harmony, weakening, and morphological changes are summarized and learnt from training corpus. And a statistical model based morpheme segmentation tool is trained on the corpus of aligned word-morpheme sequences, and applied to predict possible morpheme sequences. For an open test set, with word coverage of 86.8% and morpheme coverage of 98.4%, the morpheme segmentation accuracy is 97.6%. This morpheme segmentation tool can output both on the standard forms and on the surface forms without costing segmentation accuracy. Furthermore, for various basic lexical units of word, morpheme, and syllable, the statistical properties are compared as a comprehensive effort of the Uyghur language corpus compilation.

Full Text