SIGHAN Bakeoff Research Articles

Named entity recognition (NER) is an essential part of natural language processing tasks. Chinese NER task is different from the many European languages due to the lack of natural delimiters. Therefore, Chinese Word Segmentation (CWS) is usually regarded as the first step of processing Chinese NER. However, the word-based NER models relying on CWS are more vulnerable to incorrectly segmented entity boundaries and the presence of out-of-vocabulary (OOV) words. In this paper, we propose a novel character-based Gated Convolutional Recurrent neural network with Attention called GCRA for Chinese NER task. In particular, we introduce a hybrid convolutional neural network with gating filter mechanism to capture local context information and a highway neural network after LSTM to select characters of interest. The additional gated self-attention mechanism is used to capture the global dependencies from different multiple subspaces and arbitrary adjacent characters. We evaluate the performance of our proposed model on three datasets, including SIGHAN bakeoff 2006 MSRA, Chinese Resume, and Literature NER dataset. The experiment results show that our model outperforms other state-of-the-art models without relying on any external resources like lexicons and multi-task joint training.

Chinese text segmentation (CTS) is a fundamental step in building any Chinese or cross-language information retrieval system. This paper identifies and proposes solutions to two main challenges facing today’s CTS systems: segmenting words longer than the context window and identifying words not derived from affixation or composition. Our methods exploit unlabeled data, making them scalable at little extra cost. To tackle the first problem, we use a transductive learning approach to automatically construct a dictionary, and then refine it by improving its test set coverage while reducing its over-fitting tendency. In addition, we incorporate frequency information to discriminate overlapping matching words. For the second problem, we employ statistical association measures non-parametrically through a natural but novel feature representation scheme. To demonstrate the generality of our approach, we verify our system on the most reputable CTS evaluation standard – the SIGHAN bakeoff, which contains datasets in both traditional and simplified Chinese. These datasets are provided by representative academic or industrial research institutes. The experimental results show that with only training data and unlabeled test data and with no external dictionaries, our approach effectively overcomes the above-mentioned problems and reduces segmentation errors by an average of 27.8% compared with the traditional approach. Notably, our approach improves the recall of new words, the most informative words, by 4.7% on average. Also, our approach outperforms the best SIGHAN CTS system, which requires many external resources. Additional analysis shows that our approach has the potential to gain accuracy as the test data increases.

SIGHAN Bakeoff Research Articles

Articles published on SIGHAN Bakeoff

A parallel computing-based Deep Attention model for named entity recognition

LSTM-CRF Neural Network With Gated Self Attention for Chinese NER

A Hybrid Model for Chinese Spelling Check

A Chinese Named Entity Recognition System with Neural Networks

Combining Self Learning and Active Learning for Chinese Named Entity Recognition

Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Chinese word segmentation as morpheme-based lexical chunking

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

SIGHAN Bakeoff Research Articles

Articles published on SIGHAN Bakeoff

A parallel computing-based Deep Attention model for named entity recognition

LSTM-CRF Neural Network With Gated Self Attention for Chinese NER

A Hybrid Model for Chinese Spelling Check

A Chinese Named Entity Recognition System with Neural Networks

Combining Self Learning and Active Learning for Chinese Named Entity Recognition

Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures

Chinese word segmentation as morpheme-based lexical chunking