Abstract

Chinese word embedding models capture Chinese semantics based on the character feature of Chinese words and the internal features of Chinese characters such as radical, component, stroke, structure and pinyin. However, some features are overlapping and most methods do not consider their relevance. Meanwhile, they express words as point vectors that cannot better capture different aspect semantics of Chinese words. In this paper, we propose a Feature Subsequence based Probability Representation Model (FSPRM) for learning Chinese word embeddings, in which we first integrate the morphological and phonetic features (stroke, structure and pinyin) of Chinese characters and learn their relevance by designing a feature subsequence to capture relatively comprehensive semantics of Chinese words, then feature probability distribution is proposed for capturing different aspect meanings of Chinese words based on the three internal features and probability representation by estimating its mean as the sum of feature subsequences. Chinese words with similar features may have similar semantics, then we map Chinese words to feature probability distributions and design a similarity-based objective for predicting the contextual words of the target word to learn their semantics. Extensive experiments on word analogy, word similarity, text classification and named entity recognition tasks demonstrate that the proposed method outperforms most state-of-the-art approaches.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call