Abstract

Word segmentation is an essential and challenging task in natural language processing, especially for the Chinese language due to its high linguistic complexity. Existing methods for Chinese word segmentation, including statistical machine learning methods and neural network methods, usually have good performance in specific knowledge domains. Given the increasing importance of interdisciplinary and cross-domain studies, one of the challenges in cross-domain word segmentation is to handle the out-of-vocabulary (OOV) words. Existing methods show unsatisfactory performance to meet the practical standard. To this end, we propose a document-level context-aware model that can automatically perceive and identify OOV words from different domains. Our method jointly implements a word-based and a character-based model and then processes the results with a newly proposed reconstruction model. We evaluate the new method by designing and conducting comprehensive experiments on two real-world datasets (e.g., news from different domains). The results demonstrate the superiority of our method over the state-of-the-art models in handling texts from different domains. Importantly, when doing the word segmentation under the cross-domain scenario, our proposed method can improve the performance of OOV words recognition.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call