Abstract

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.

Highlights

  • In recent years, pre-trained language models, such as the representative BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020), have gained great success in natural language processing tasks (Peters et al, 2018a; Radford et al, 2018; Yang et al, 2019; Clark et al, 2020)

  • Some recent works have unveiled that some self-attention heads in pre-trained models tend to learn local dependencies due to the inherent property of natural language (Kovaleva et al, 2019; Brunner et al, 2020; Jiang et al, 2020), incurring computation redundancy for capturing local information

  • It has been shown that the sandwich order can bring improvement on language modeling task, indicating the layer order contributes to model performance

Read more

Summary

Introduction

Pre-trained language models, such as the representative BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020), have gained great success in natural language processing tasks (Peters et al, 2018a; Radford et al, 2018; Yang et al, 2019; Clark et al, 2020). The backbone architectures of these models mostly adopt a stereotyped. Self-Attention 1 Feed-Forward 2 Convolution 3 Interleaved 4 Sandwich. {12} × 4 → BERT/ELECTRA {23} × 4 → DynamicConv.

Objectives
Methods
Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.