On the effect of dropping layers of pre-trained transformer models

Hassan Sajjad,Fahim Dalvi,Nadir Durrani,Preslav Nakov

doi:10.1016/j.csl.2022.101429

Abstract

Transformer-based NLP models are trained using hundreds of millions or even billions of parameters, limiting their applicability in computationally constrained environments. While the number of parameters generally correlates with performance, it is not clear whether the entire network is required for a downstream task. Motivated by the recent work on pruning and distilling pre-trained models, we explore strategies to drop layers in pre-trained models, and observe the effect of pruning on downstream GLUE tasks. We were able to prune BERT, RoBERTa and XLNet models up to 40%, while maintaining up to 98% of their original performance. Additionally we show that our pruned models are on par with those built using knowledge distillation, both in terms of size and performance. Our experiments yield interesting observations such as: (i) the lower layers are most critical to maintain downstream task performance, (ii) some tasks such as paraphrase detection and sentence similarity are more robust to the dropping of layers, and (iii) models trained using different objective function exhibit different learning patterns and w.r.t the layer dropping.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

On the effect of dropping layers of pre-trained transformer models

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Jul 22, 2022
Citations: 26

Similar Papers

Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model
Lanxin Zhao ... Jianbin Fang
Applied Sciences | VOL. 14
Lanxin Zhao, et. al.Lanxin Zhao ... Jianbin Fang
11 Mar 2024
Applied Sciences | VOL. 14

Knowledge distillation and data augmentation for NLP light pre-trained models
Hanwen Luo ... Yuqing Zhang
Journal of Physics: Conference Series | VOL. 1651
Hanwen Luo, et. al.Hanwen Luo ... Yuqing Zhang
01 Nov 2020
Journal of Physics: Conference Series | VOL. 1651

Analysis of representation and generalization capabilities of pre-trained audio models in urban environments
Daniele Atzeni ... Ester Vidaña-Vila
INTER-NOISE and NOISE-CON Congress and Conference Proceedings | VOL. 270
Daniele Atzeni, et. al.Daniele Atzeni ... Ester Vidaña-Vila
04 Oct 2024
INTER-NOISE and NOISE-CON Congress and Conference Proceedings | VOL. 270

Point-PEFT: Parameter-Efficient Fine-Tuning for 3D Pre-trained Models
Yiwen Tang ... Bin Zhao
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Yiwen Tang, et. al.Yiwen Tang ... Bin Zhao
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On the effect of dropping layers of pre-trained transformer models

Abstract

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language