A Statistical Learning Approach to Improving the Accuracy of Chinese Word Segmentation

W.-K. Kan,C.-H. Leung

doi:10.1093/llc/11.2.87

Abstract

In Chinese, there is no delimiter separating successive words in a sentence. Chinese word segmentation, which is a process of identifying word boundaries in text, is an essential step for Chinese language processing. There are different word segmentation algorithms. However, because of the irregularities of syntactic and semantic features in Chinese, it is difficult to obtain word segmentation accuracy of 100%. To solve this problem, a statistical learning approach to improving the accuracy of Chinese word segmentation is proposed. Based on statistical correlations between incorrect segmented strings and their contexts, a number of rules governing the modification from incorrect segmented strings to correct ones are constructed. These rules can be applied to word segmentation results obtained by an automatic word segmentation algorithm. They can modify the word segmentation results and make them more accurate

Full Text