Enhancing CRF-based Chinese Word Segmentation Using a Rapid and Effective Feature Template Selection Algorithm and Character Normalization

Yaxuan Ren ,Dehua Li

doi:10.1109/icctec.2017.00112

Abstract

Conditional random fields (CRFs) are among the classic models for Chinese word segmentation (CWS). Deep neural networks (DNNs) have recently emerged as a research hotspot in natural language processing (NLP). However, studies exploring the use of DNN for CWS have not yielded significant gains over CRF models. Thus, developing CRFs for CWS remains a viable avenue for research. This paper proposes two methods to enhance CRF-based CWS. First, a rapid and effective sequential forward selection (SFS)-style method is utilized for feature template selection to balance search performance with search speed. Second, it describes a method for character normalization more robust than the traditional method. Incremental evaluations on the second SIGHAN bakeoff show that the two proposed methods reduce the error by 7.8%, and 10.6% respectively in terms of F-score. The final system achieved an F-score of 0.955 (AS), 0.955 (CITYU), 0.970 (MSR), and 0.952 (PKU), which is comparable to those of the best systems reported in the reference.

Full Text