Chinese word segmentation (CWS) is an important task for Chinese NLP, and also an essential pre-processing step to establish a word-root database for security classification of power data, covering different domains such as laws & regulations, power. It is impracticable to label a large number of training corpus for each domain, which brings great challenge to the supervised statistical learning method to carry out effective CWS. Therefore, a Chinese word segmentation approach based on dictionary and semi-supervised conditional random field (SS-CRF ) is presented. At first, a CRF model for CWS is trained with self-training and active learning algorithms and used to conduct CWS task. Then the dictionary features are introduced to correct the result of CRF based segmentation by adopting RMM algorithm. Experiments on a cross domain segmentation task show that the proposed method can effectively improve the domain-adaptive performance of CWS.
Read full abstract