Chinese Named Entity Recognition Method in History and Culture Field Based on BERT

Shuang Liu,Simon Kolmanič,Hui Yang,Jiayi Li

doi:10.1007/s44196-021-00019-8

Abstract

With rapid development of the Internet, people have undergone tremendous changes in the way they obtain information. In recent years, knowledge graph is becoming a popular tool for the public to acquire knowledge. For knowledge graph of Chinese history and culture, most researchers adopted traditional named entity recognition methods to extract entity information from unstructured historical text data. However, the traditional named entity recognition method has certain defects, and it is easy to ignore the association between entities. To extract entities from a large amount of historical and cultural information more accurately and efficiently, this paper proposes one named entity recognition model combining Bidirectional Encoder Representations from Transformers and Bidirectional Long Short-Term Memory-Conditional Random Field (BERT-BiLSTM-CRF). First, a BERT pre-trained language model is used to encode a single character to obtain a vector representation corresponding to each character. Then one Bidirectional Long Short-Term Memory (BiLSTM) layer is applied to semantically encode the input text. Finally, the label with the highest probability is output through the Conditional Random Field (CRF) layer to obtain each character’s category. This model uses the Bidirectional Encoder Representations from Transformers (BERT) pre-trained language model to replace the static word vectors trained in the traditional way. In comparison, the BERT pre-trained language model can dynamically generate semantic vectors according to the context of words, which improves the representation ability of word vectors. The experimental results prove that the model proposed in this paper has achieved excellent results in the task of named entity recognition in the field of historical culture. Compared with the existing named entity identification methods, the precision rate, recall rate, and F_1 value have been significantly improved.

Highlights

With the rapid development of the Internet, people’s lifestyles and ways of understanding Chinese history and culture are changing
The baseline model used in this paper is the Bidirectional Long Short-Term Memory (BiLSTM)-Conditional Random Field (CRF) model, which is the most widely used in named entity recognition applications
The spliced vector is input into BiLSTM, and the output result is transferred to CRF after training is completed, and the optimal sequence label is selected in the CRF layer

Summary

Introduction

With the rapid development of the Internet, people’s lifestyles and ways of understanding Chinese history and culture are changing. More and more provinces in China begin to pay attention to the development of history and culture and the construction of online cultural information platform. With more and more emergence of intelligent museums and digital museum, the online information platform of network history and culture has attracted more and more attention for the public. More and more scholars began to study in this field [1]. Since China has five thousand years of history and culture, historical culture has become an indispensable part of our lives [2]. Facing massive online historical and cultural data, how to automatically extract potential knowledge from massive unstructured text data, how to extract the relevant content of historical culture and how to organize

Objectives

Results

Conclusion