Abstract

Currently, the most widespread neural network architecture for training language models is the so-called BERT, which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants, such as ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention, which reduces BERTs memory usage and leads to a small increase in performance compared to classical Multi-Head-Attention. We evaluate these techniques on five public German Named Entity Recognition (NER) tasks, of which two are introduced by this article.

Highlights

  • Named Entity Recognition (NER) is a well-known task in the field of Natural Language Processing (NLP)

  • We investigate the influence of different Bidirectional Encoder Representations from Transformers (BERT) pre-training methods, such as pretraining tasks, varying positional encoding, and adding Whole-Word Masking on a total of five different NER datasets

  • We examine the influences of the different pre-training tasks (MLM, Next Sentence Prediction (NSP), and Sentence Order Prediction (SOP)) with the focus on improving the training of BERT for German NER tasks

Read more

Summary

Introduction

NER is a well-known task in the field of NLP. The NEISS project [1] in which we work in close cooperation with Germanists is devoted to the automation of diverse processes during the creation of digital editions. The best results for NER tasks have been achieved with Transformer-based [2] language models, such as Bidirectional Encoder Representations from Transformers (BERT) [3]. A BERT is first pre-trained with large amounts of unlabeled text to obtain a robust language model and fine-tuned to a downstream task. Pre-training is resource-intensive and takes a long time (several weeks) for training. Online platforms, such as Hugging Face [7], offer a zoo of already pre-trained networks that can be directly used to train a downstream task. The available models are not always suitable for a certain task, such as NER, in German because they can be pre-trained on a different domain (e.g., language, time epoch, or text style)

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call