Abstract

Supervised word segmentation heavily relies on large-scale and high quality labeled data. However, building such a corpus is difficult, especially with respect to domain specific data. In this paper, we propose a novel semi-supervised Chinese word segmentation (CWS) method. Specifically, we seek to select more useful sample sentences from the large-scale unlabeled sentences to extend the training data, by means of a sampling strategy that uses character-based semantic similarity. The presented similarity algorithm is used to calculate the similarity between unlabeled sentences and the training data, which can help select helpful sample sentences from unlabeled data. In addition, we integrate an attention mechanism into our word segmentation model to focus on available contextual information. Experiments on PKU, MSR and Weibo benchmark data sets show that our method outperforms the previous neural network models and state-of-the-art methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.