FMEBA: A Fusion Multi-feature Model for Chinese Out of Vocabulary Word Embedding Generation

Xianghong Huang,Xin Yuan,Qun Yang,Shaohan Liu,Rong Guo

doi:10.1007/978-3-030-70665-4_173

Abstract

The word embedding model embeds each word into a low-dimensional space by using the distribution information of unlabeled words in the corpus, such that the generalization ability of lexical features can be improved. However, the performance of word embedding models is limited by Out of Vocabulary (OOV) words because the relevant information of OOV words can’t be fully used to generate accurate word embedding. To effectively process OOV words, morphological structure information and context information should be considered. In view of the characteristics of Chinese, we propose a Fusion Multi-feature Encoder Based on Attention (FMEBA) for processing Chinese OOV words, in which we use the radical feature of characters, as well as use character-level Transformer Encoder to process character sequence information and context information. To test our model, we conducted experiments on a Chinese power plan professional dataset. Experimental results on the dataset shows compared with other models, our model achieved the best results. We conclude that our method is suitable for processing Chinese OOV words.

Full Text