Annotation and Classification of Three-Character Chinese Synthetic Words

Jia Lu,Masayuki Asahara,Yuji Matsumoto

doi:10.1142/s1793840608001846

Abstract

The lack of internal information of Chinese synthetic words has become a crucial problem for Chinese morphological analysis systems, which are facing various needs of segmentation standards for upper NLP applications being developed. In this paper, we first define the conceptual differences between Chinese single-morpheme words and Chinese synthetic words. Then we define Chinese synthetic words into two types, compound words and morphologically derived words, according to their internal syntactic and morphological structure and classify them into more specific categories. After making a survey on three-character Chinese synthetic words based on these categories, we propose a tree-based analysis method to represent the internal information of the words. Next, we use machine learning methods to automatically identify the internal morphological structure of three-character synthetic words by using a large corpus and add syntactic tags to their internal structure. We believe that the tree-based word internal information is useful in specifying a Chinese synthetic word segmentation standard. We also believe that the internal information of Chinese synthetic words can help to improve morphological analysis and out-of-vocabulary (OOV) word detection of Chinese text.

Full Text