Abstract

The Chinese NER task consists of two steps, first determining entity boundaries and then labeling them. Some previous work incorporating related words from pre-trained vocabulary into character-based models has been demonstrated to be effective. However, the number of words that characters can match in the vocabulary is large, and their meanings vary widely. It is unreasonable to concatenate all the matched words into the character's representation without making semantic distinctions. This is because words with different semantics also have distinct vectors by the distributed representation. Moreover, mutual information maximization (MIM) provides a unified way to characterize the correction between different granularity of embeddings, we find it can be used to enhance the features in our task. Consequently, this paper introduces a novel Chinese NER model named SSMI based on semantic similarity and MIM. We first match all the potential word boundaries of the input characters from the pre-trained vocabulary and employ BERT to segment the input sentence to get the segmentation containing these characters. After computing their cosine similarity, we obtain the word boundary with the highest similarity and the word group with similarity score larger than a specific threshold. Then, we concatenate the most relevant word boundaries with character vectors. We further calculate the mutual information maximization of group, character and sentence, respectively. Finally, we feed the result from the above steps to our novel network. The results on four Chinese public NER datasets show that our SSMI achieves state-of-the-art performance.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call