LV-BERT: Exploiting Layer Variety for BERT

Weihao Yu,Jiashi Feng,Zihang Jiang,Qibin Hou,Fei Chen

doi:10.18653/v1/2021.findings-acl.2

Abstract

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 79.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.

Highlights

In recent years, pre-trained language models, such as the representative BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020), have gained great success in natural language processing tasks (Peters et al, 2018a; Radford et al, 2018; Yang et al, 2019; Clark et al, 2020)
Some recent works have unveiled that some self-attention heads in pre-trained models tend to learn local dependencies due to the inherent property of natural language (Kovaleva et al, 2019; Brunner et al, 2020; Jiang et al, 2020), incurring computation redundancy for capturing local information
It has been shown that the sandwich order can bring improvement on language modeling task, indicating the layer order contributes to model performance

Summary

Introduction

Pre-trained language models, such as the representative BERT (Devlin et al, 2019) and GPT-3 (Brown et al, 2020), have gained great success in natural language processing tasks (Peters et al, 2018a; Radford et al, 2018; Yang et al, 2019; Clark et al, 2020). The backbone architectures of these models mostly adopt a stereotyped. Self-Attention 1 Feed-Forward 2 Convolution 3 Interleaved 4 Sandwich. {12} × 4 → BERT/ELECTRA {23} × 4 → DynamicConv.

Objectives

Methods

Results

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

LV-BERT: Exploiting Layer Variety for BERT

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Similar Papers

LV-BERT: Exploiting Layer Variety for BERT

-

01 Aug 2021
01 Aug 2021

Progress in protein pre-training models integrating structural knowledge
Tian-Yi Tang ... Wen-Fei Li
Acta Physica Sinica | VOL. 73
Tian-Yi Tang, et. al.Tian-Yi Tang ... Wen-Fei Li
01 Jan 2024
Acta Physica Sinica | VOL. 73

TiBERT: Tibetan Pre-trained Language Model
Sisi Liu ... Junjie Deng
-
Sisi Liu, et. al.Sisi Liu ... Junjie Deng
09 Oct 2022
09 Oct 2022

Expanding Large Pre-trained Unimodal Models with Multimodal Information Injection for Image-Text Multimodal Classification
Tao Liang ... Fengmao Lv
-
Tao Liang, et. al.Tao Liang ... Fengmao Lv
01 Jun 2022
01 Jun 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

LV-BERT: Exploiting Layer Variety for BERT

Abstract

Highlights

Summary

Talk to us

Similar Papers