Abstract

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a character’s glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each character’s task-relevant semantic and syntactic information in the character-level embedding.

Highlights

  • In combination with deep learning, character-level and subword-unit-level models has achieved the state-of-the-art performance in various natural language processing (NLP) tasks involving Western languages (Wu et al, 2016), we consider the equivalent modeling problem for solving NLP tasks in Chinese

  • We explore the effect of incorporating glyphs as additional features in the context of two common Chinese NLP tasks, segmentation and language modeling, resulting in a novel glyph-aware embedding of Chinese characters

  • In keeping with the common neural network model architectures, we decided to feed the glyph as an input to a feed-forward neural network (FNN) model, an embedder, that outputs an embedding vector which, in both the segmentation task and the language modeling task, is consumed by a recurrent neural network to make predictions

Read more

Summary

Introduction

In combination with deep learning, character-level and subword-unit-level models has achieved the state-of-the-art performance in various natural language processing (NLP) tasks involving Western languages (Wu et al, 2016), we consider the equivalent modeling problem for solving NLP tasks in Chinese. According to Table of General Standard Characters (通用规范汉字表) compiled by the Chinese government in 2013, there are 3,500 level-1 (being the most common) characters and more than 8,105 characters in total (Wikipedia, 2017). It is not correct to treat Chinese characters as equivalent to English words because the distribution of Chinese characters deviate markedly from Zipf’s law (Zipf, 1935; Shtrikman, 1994). There is evidence suggesting that segmented Chinese words, - some of them are unigrams -, distribute according to Zipf’s law (Xiao, 2008). The closest equivalent linguistic unit in English corresponding to a Chinese character is a subword unit, i.e., word fragments

Methods
Results
Discussion
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.