Abstract

Fully supervised neural approaches have achieved significant progress in the task of Chinese word segmentation (CWS). Nevertheless, the performance of supervised models tends to drop dramatically when they are applied to out-of-domain data. Performance degradation is caused by the distribution gap across domains and the out of vocabulary (OOV) problem. In order to simultaneously alleviate these two issues, this paper proposes to couple distant annotation and adversarial training for cross-domain CWS. For distant annotation, we rethink the essence of "Chinese words" and design an automatic distant annotation mechanism that does not need any supervision or pre-defined dictionaries from the target domain. The approach could effectively explore domain-specific words and distantly annotate the raw texts for the target domain. For adversarial training, we develop a sentence-level training procedure to perform noise reduction and maximum utilization of the source domain information. Experiments on multiple real-world datasets across various domains show the superiority and robustness of our model, significantly outperforming previous state-of-the-art cross-domain CWS methods.

Highlights

  • Chinese is an ideographic language and lacks word delimiters between words in written sentences

  • ∗ Corresponding author and Zhao, 2016; Liu et al, 2016; Cai et al, 2017; Ma et al, 2018). These approaches have achieved significant progress on in-domain Chinese word segmentation (CWS) tasks, but they still suffer from the cross-domain issue when they come to processing of out-of-domain data

  • 1source code and dataset will be available at https:// github.com/Alibaba-NLP/DAAT-CWS

Read more

Summary

Introduction

Chinese is an ideographic language and lacks word delimiters between words in written sentences. Chinese word segmentation (CWS) is often regarded as a prerequisite to downstream tasks in Chinese natural language processing This task is conventionally formalized as a characterbased sequence tagging problem (Peng et al, 2004), where each character is assigned a specific label to denote the position of the character in a word. With the development of deep learning techniques, recent years have seen increasing interest in applying neural network models onto CWS (Cai and Zhao, 2016; Liu et al, 2016; Cai et al, 2017; Ma et al, 2018) These approaches have achieved significant progress on in-domain CWS tasks, but they still suffer from the cross-domain issue when they come to processing of out-of-domain data. Segmenters built on the newswire domain have very limited information to segment domain-specific words like “溶菌酶 (Lysozyme)”

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call