Improving Chinese named entity recognition with lexical information

Guo-Hong Fu

doi:10.1109/icmlc.2009.5212793

Abstract

Named entity recognition (NER) plays a critical role in many natural language processing applications. Chinese NER is usually formalized as a chunking task. However, most formulations do not distinguish named entities from common words. This makes it difficult to explore lexical cues for NER. In this paper we propose a two-level IOB2 representation to merge lexical chunks and entity chunks, and develop a morpheme-based chunking system for Chinese NER. It works in three main steps: Given a plain Chinese sentence, a morpheme segmenter first segments it into a sequence of morphemes, then a lexical chunker is applied to tag each segmented morpheme with a proper lexical chunk tag indicating its position pattern in forming a word of a special type, and finally an entity chunker continues to label each morpheme with a hybrid chunk tag, containing the related entity boundary and category information if any. Our experiments on the IEER-99 and MET2 data demonstrate a significant enhancement of NER performance after using entity-internal part-of-speech information. We also show that lexical chunking quality is of importance for NER results.

Full Text