Optimizing Small BERTs Trained for German NER

Jochen Zöllner,Christoph Wick,Roger Labahn,Konrad Sperfeld

doi:10.3390/info12110443

Jochen Zöllner, Christoph Wick + Show 2 more

Open Access

PDF Available

https://doi.org/10.3390/info12110443

Copy DOI

Export

Save

Cite

Journal: Information	Publication Date: Oct 25, 2021
Citations: 1	License type: CC BY 4.0

Affiliation: University of Rostock

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Currently, the most widespread neural network architecture for training language models is the so-called BERT, which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a BERT model, the better the results obtained in these NLP tasks. Unfortunately, the memory consumption and the training duration drastically increases with the size of these models. In this article, we investigate various training techniques of smaller BERT models: We combine different methods from other BERT variants, such as ALBERT, RoBERTa, and relative positional encoding. In addition, we propose two new fine-tuning modifications leading to better performance: Class-Start-End tagging and a modified form of Linear Chain Conditional Random Fields. Furthermore, we introduce Whole-Word Attention, which reduces BERTs memory usage and leads to a small increase in performance compared to classical Multi-Head-Attention. We evaluate these techniques on five public German Named Entity Recognition (NER) tasks, of which two are introduced by this article.

Highlights

Named Entity Recognition (NER) is a well-known task in the field of Natural Language Processing (NLP)
We investigate the influence of different Bidirectional Encoder Representations from Transformers (BERT) pre-training methods, such as pretraining tasks, varying positional encoding, and adding Whole-Word Masking on a total of five different NER datasets
We examine the influences of the different pre-training tasks (MLM, Next Sentence Prediction (NSP), and Sentence Order Prediction (SOP)) with the focus on improving the training of BERT for German NER tasks

Summary

Introduction

NER is a well-known task in the field of NLP. The NEISS project [1] in which we work in close cooperation with Germanists is devoted to the automation of diverse processes during the creation of digital editions. The best results for NER tasks have been achieved with Transformer-based [2] language models, such as Bidirectional Encoder Representations from Transformers (BERT) [3]. A BERT is first pre-trained with large amounts of unlabeled text to obtain a robust language model and fine-tuned to a downstream task. Pre-training is resource-intensive and takes a long time (several weeks) for training. Online platforms, such as Hugging Face [7], offer a zoo of already pre-trained networks that can be directly used to train a downstream task. The available models are not always suitable for a certain task, such as NER, in German because they can be pre-trained on a different domain (e.g., language, time epoch, or text style)

Objectives

Methods

Results

Conclusion