Glyph-aware Embedding of Chinese Characters

Falcon Dai,Zheng Cai

doi:10.18653/v1/w17-4109

Abstract

Given the advantage and recent success of English character-level and subword-unit models in several NLP tasks, we consider the equivalent modeling problem for Chinese. Chinese script is logographic and many Chinese logograms are composed of common substructures that provide semantic, phonetic and syntactic hints. In this work, we propose to explicitly incorporate the visual appearance of a character’s glyph in its representation, resulting in a novel glyph-aware embedding of Chinese characters. Being inspired by the success of convolutional neural networks in computer vision, we use them to incorporate the spatio-structural patterns of Chinese glyphs as rendered in raw pixels. In the context of two basic Chinese NLP tasks of language modeling and word segmentation, the model learns to represent each character’s task-relevant semantic and syntactic information in the character-level embedding.

Highlights

In combination with deep learning, character-level and subword-unit-level models has achieved the state-of-the-art performance in various natural language processing (NLP) tasks involving Western languages (Wu et al, 2016), we consider the equivalent modeling problem for solving NLP tasks in Chinese
We explore the effect of incorporating glyphs as additional features in the context of two common Chinese NLP tasks, segmentation and language modeling, resulting in a novel glyph-aware embedding of Chinese characters
In keeping with the common neural network model architectures, we decided to feed the glyph as an input to a feed-forward neural network (FNN) model, an embedder, that outputs an embedding vector which, in both the segmentation task and the language modeling task, is consumed by a recurrent neural network to make predictions

Summary

Introduction

In combination with deep learning, character-level and subword-unit-level models has achieved the state-of-the-art performance in various natural language processing (NLP) tasks involving Western languages (Wu et al, 2016), we consider the equivalent modeling problem for solving NLP tasks in Chinese. According to Table of General Standard Characters (通用规范汉字表) compiled by the Chinese government in 2013, there are 3,500 level-1 (being the most common) characters and more than 8,105 characters in total (Wikipedia, 2017). It is not correct to treat Chinese characters as equivalent to English words because the distribution of Chinese characters deviate markedly from Zipf’s law (Zipf, 1935; Shtrikman, 1994). There is evidence suggesting that segmented Chinese words, - some of them are unigrams -, distribute according to Zipf’s law (Xiao, 2008). The closest equivalent linguistic unit in English corresponding to a Chinese character is a subword unit, i.e., word fragments

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Glyph-aware Embedding of Chinese Characters

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2017
Citations: 46	License type: cc-by

Similar Papers

TableSense: Spreadsheet Table Detection with Convolutional Neural Networks
Haoyu Dong ... Shijie Liu
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33
Haoyu Dong, et. al.Haoyu Dong ... Shijie Liu
17 Jul 2019
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 33

Neural Ranking Models with Weak Supervision
Mostafa Dehghani ... Hamed Zamani
-
Mostafa Dehghani, et. al.Mostafa Dehghani ... Hamed Zamani
07 Aug 2017
07 Aug 2017

Graph Neural Networks in Computer Vision - Architectures, Datasets and Common Approaches
Maciej Krzywda ... Szymon Lukasik
-
Maciej Krzywda, et. al.Maciej Krzywda ... Szymon Lukasik
18 Jul 2022
18 Jul 2022

A Survey of Adversarial Attacks on Deep Neural Network in Computer Vision
Qi Wang ... Jinyuan Mo
-
Qi Wang, et. al.Qi Wang ... Jinyuan Mo
09 Dec 2022
09 Dec 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Glyph-aware Embedding of Chinese Characters

Abstract

Highlights

Summary

Talk to us

Similar Papers