Abstract

Chinese word embedding models capture Chinese semantics based on the character feature of Chinese words and the internal features of Chinese characters such as radical, component, stroke, structure and pinyin. However, some features are overlapping and most methods do not consider their relevance. Meanwhile, they express words as point vectors that cannot better capture different aspect semantics of Chinese words. In this paper, we propose a Feature Subsequence based Probability Representation Model (FSPRM) for learning Chinese word embeddings, in which we first integrate the morphological and phonetic features (stroke, structure and pinyin) of Chinese characters and learn their relevance by designing a feature subsequence to capture relatively comprehensive semantics of Chinese words, then feature probability distribution is proposed for capturing different aspect meanings of Chinese words based on the three internal features and probability representation by estimating its mean as the sum of feature subsequences. Chinese words with similar features may have similar semantics, then we map Chinese words to feature probability distributions and design a similarity-based objective for predicting the contextual words of the target word to learn their semantics. Extensive experiments on word analogy, word similarity, text classification and named entity recognition tasks demonstrate that the proposed method outperforms most state-of-the-art approaches.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.