Most Chinese word representation approaches mine the lexeme of Chinese words by predicting the target word or the contextual words. However, Chinese words may have multiple meanings, above ways may only capture one aspect implication. In this paper, for better modeling multiple connotations of Chinese words, a Feature integRation pre-training based Gaussian Embedding Model (FRGEM) is proposed with three stages, feature integration pre-training, relevancy Gaussian representation and similarity-based training. In the first stage, the internal sequences with three inner-character stroke, structure and pinyin features of Chinese words are generated to learn their semantic relevancy, then we combine BERT with internal sequences to extract sentence relevance and integrate the internal and sentence features to learn pre-training representations. In the second stage, relevancy Gaussian representation with pre-training embeddings is proposed to fuse inner-character features with Gaussian through estimating its mean to analyze multiple implications of Chinese words. In the third stage, a similarity-based objective is proposed to distinguish true contextual words and learn Chinese word representations. Extensive experiments on word similarity, word analogy, named entity recognition and text classification show that FRGEM is superior to majority state-of-the-art algorithms.
Read full abstract