Abstract
Traditional named entity recognition methods mainly explore the application of hand-crafted features. Currently, with the popularity of deep learning, neural networks have been introduced to capture deep features for named entity recognition. However, most existing methods only aim at modern corpus. Named entity recognition in ancient literature is challenging because names in it have evolved over time. In this paper, we attempt to recognise entities by exploring the characteristics of characters and strokes. The enhanced character embedding model, named ECEM, is proposed on the basis of bidirectional encoder representations from transformers and strokes. First, ECEM can generate the semantic vectors dynamically according to the context of the words. Second, the proposed algorithm introduces morphological-level information of Chinese words. Finally, the enhanced character embedding is fed into the bidirectional long short term memory-conditional random field model for training. To explore the effect of our proposed algorithm, experiments are carried out on both ancient literature and modern corpus. The results indicate that our algorithm is very effective and powerful, compared with traditional ones.
Highlights
Because of the popularity of the web, a great many unstructured texts have emerged to represent web contents
Numerous machine learning approaches have been carefully studied for named entity recognition (NER) task, including Conditional Random Fields (CRFs), Support Vector Machines (SVMs) and Hidden Markov Models (HMMs).[9]
ECEM is first used in ancient literature named entity recognition, which captures context information and abundant knowledge by fine-tuning Bidirectional Encoder Representations from Transformers (BERT), and flexibly acquires morphological information generated through strokes
Summary
Because of the popularity of the web, a great many unstructured texts have emerged to represent web contents. Word-based algorithms have achieved certain effect in Chinese NER,[11,15] there are still many challenges. This is because the names of people, places and organisations are increasing without a uniform naming rule, and the ambiguity of Chinese language is inherent. The precision is high, but recall is low In response to these challenges, an enhanced character embedding algorithm, named ECEM, is proposed, while BERT and strokes are integrated to learn the character representation and explore the performance in the Chinese NER domain. ECEM is first used in ancient literature named entity recognition, which captures context information and abundant knowledge by fine-tuning BERT, and flexibly acquires morphological information generated through strokes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.