Prune Once for All: Sparse Pre-Trained Language Models

Ofir Zafrir ,Haihao Shen ,Ariel Larey ,Guy Boudoukh ,Moshe Wasserblat

doi:10.5281/zenodo.6967409

Abstract

Transformer-based language models are applied to a wide range of applications in natural language processing. However, they are inefficient and difficult to deploy. In recent years, many compression algorithms have been proposed to increase the implementation efficiency of large Transformer-based models on target hardware. In this work we present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We demonstrate our method with three known architectures to create sparse pre-trained BERT-Base, BERT-Large and DistilBERT. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss. Moreover, we show how to further compress the sparse models' weights to 8bit precision using quantization-aware training. For example, with our sparse pre-trained BERT-Large fine-tuned on SQuADv1.1 and quantized to 8bit we achieve a compression ratio of $40$X for the encoder with less than $1\%$ accuracy loss. To the best of our knowledge, our results show the best compression-to-accuracy ratio for BERT-Base, BERT-Large, and DistilBERT.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Prune Once for All: Sparse Pre-Trained Language Models

Abstract

Talk to us

Similar Papers

More From: arXiv (Cornell University)

Lead the way for us

Similar Papers

A Study of Vietnamese Sentiment Classification with Ensemble Pre-Trained Language Models
Dang Van Thin ... Duong Ngoc Hao
Vietnam Journal of Computer Science | VOL. 11
Dang Van Thin, et. al.Dang Van Thin ... Duong Ngoc Hao
07 Dec 2023
Vietnam Journal of Computer Science | VOL. 11

AMMU: A survey of transformer-based biomedical pretrained language models
Katikapalli Subramanyam Kalyan ... Sivanesan Sangeetha
Journal of Biomedical Informatics | VOL. 126
Katikapalli Subramanyam Kalyan, et. al.Katikapalli Subramanyam Kalyan ... Sivanesan Sangeetha
31 Dec 2021
Journal of Biomedical Informatics | VOL. 126

Application of Transformer-Based Language Models to Detect Hate Speech in Social Media
Swapnanil Mukherjee ... Sujit Das
Journal of Computational and Cognitive Engineering | VOL. 2
Swapnanil Mukherjee, et. al.Swapnanil Mukherjee ... Sujit Das
17 Dec 2021
Journal of Computational and Cognitive Engineering | VOL. 2

Transformer-based deep neural network language models for Alzheimer\u2019s disease risk assessment from targeted speech
Alireza Roshanzamir ... Hamid Aghajan
BMC Medical Informatics and Decision Making | VOL. 21
Alireza Roshanzamir, et. al.Alireza Roshanzamir ... Hamid Aghajan
09 Mar 2021
BMC Medical Informatics and Decision Making | VOL. 21

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Prune Once for All: Sparse Pre-Trained Language Models

Abstract

Talk to us

Similar Papers

More From: arXiv (Cornell University)