Uyghur Word Segmentation using a Combination of Rules and Statistics

Huajian Xue ,Yong Yang ,Ronghui Zhang ,Xiao Li ,Turghun Osman

doi:10.4156/aiss.vol3.issue11.13

Abstract

Rich morphology of Uyghur produces a large number of words and leads to high out of vocabulary (OOV) rates that can cause many errors in Uyghur natural language processing (NLP). Morphological word segmentation is the very important component to overcome this problem caused by Uyghur morphology. This paper depicts some morphological rules by analyzing the universal structure of Uyghur words and presents a partly supervised word segmentation method. In this method, the suffix corpus was utilized to give all the possible morphological word segmentations, from which the optimal word segmentation is selected by the MAP-based model. In addition, cascaded language model was used to improve the accuracy of word segmentation. The test set composed of 5000 words was collected and segmented by hand. The experiment on this test set was given and experimental results show that the proposed method was more effective.

Full Text