On the instability of further pre-training: Does a single sentence matter to BERT?

Luca Bacco,Gosse Minnema,Tommaso Caselli,Felice Dell’Orletta,Mario Merone,Malvina Nissim

doi:10.1016/j.nlp.2023.100037

Abstract

We observe a remarkable instability in BERT-like models: minimal changes in the internal representations of BERT, as induced by one-step further pre-training with even a single sentence, can noticeably change the behaviour of subsequently fine-tuned models. While the pre-trained models seem to be essentially the same, also by means of established similarity assessment techniques, the measurable tiny changes appear to substantially impact the models’ tuning path, leading to significantly different fine-tuned systems and affecting downstream performance. After testing a very large number of combinations, which we briefly summarize, the experiments reported in this short paper focus on an intermediate phase consisting of a single-step and single-sentence masked language modeling stage and its impact on a sentiment analysis task. We discuss a series of unexpected findings which leave some open questions over the nature and stability of further pre-training.

Full Text