Abstract

Chinese word segmentation plays an important role in Chinese text mining. It is the foundation of automatic relation extraction and identification in Chinese information processing. In this paper, we propose a method for Chinese word segmentation based on conditional random fields (CRF) with character clustering. For the character clustering, we firstly use the Skip-Gram model to obtain character embedding from a raw corpus (without word delimiters). We then apply two different clustering algorithms, K-means and Brown clustering algorithm, to get the clusters of character embedding. The effect of different numbers of dimensions of character embedding, the number of clusters, and different clustering algorithms have been studied. We verify our method using the 4th CCF Conference on Natural Language Processing and Chinese Computing (NLPCC2015) Weibo text segmentation task. Our system achieves an F-score of 95.67% and an out of vocabulary (OOV) rate of 94.78%. The result shows that clustering character embedding based on character representation can improve the performance of Chinese word segmentation on short text.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.