Abstract
Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.
Highlights
In recent years, pre-trained language models, such as the representative BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020), have gained great success in natural language processing tasks (Peters et al, 2018a; Radford et al, 2018; Yang et al, 2019; Clark et al, 2020)
Some recent works have unveiled that some self-attention heads in pre-trained models tend to learn local dependencies due to the inherent property of natural language (Kovaleva et al, 2019; Brunner et al, 2020; Jiang et al, 2020), incurring computation redundancy for capturing local information
It has been shown that the sandwich order can bring improvement on language modeling task, indicating the layer order contributes to model performance
Summary
Pre-trained language models, such as the representative BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020), have gained great success in natural language processing tasks (Peters et al, 2018a; Radford et al, 2018; Yang et al, 2019; Clark et al, 2020). The backbone architectures of these models mostly adopt a stereotyped. Self-Attention 1 Feed-Forward 2 Convolution 3 Interleaved 4 Sandwich. {12} × 4 → BERT/ELECTRA {23} × 4 → DynamicConv.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.