Learning Chinese Word Embeddings from Stroke, Structure and Pinyin of Characters

Yun Zhang,Shuangqing Zhai,Ziqiang Zheng,Weiguang Wang,Yongguo Liu,Zijie Chen,Jiajing Zhu,Xiaofeng Liu

doi:10.1145/3357384.3358005

Abstract

Chinese word embeddings have recently attracted much attention in natural language processing (NLP). Existing researches learn Chinese word embeddings based on characters, radicals, components and stroke n-gram. Besides abovementioned features, Chinese characters also own structure and pinyin features. In this paper, we design feature substring, a super set of radicals, components and stroke n-gram with structure and pinyin information, to integrate stroke, structure and pinyin features of Chinese characters and capture the semantics of Chinese words. Based on the feature substring, we propose a novel method ssp2vec to predict the contextual words based on the feature substrings of the target words for learning Chinese word embeddings. It is based on our observation that exploiting the morphological information (stroke and structure) and the phonetic information (pinyin) is crucial for capturing the meanings of Chinese words. Meanwhile, the phonetic information (pinyin) can assist the model to distinguish Chinese words. Experimental results on word analogy, word similarity, text classification and named entity recognition tasks show that the proposed method obtains better results than state-of-the-art approaches.

Full Text