Blockwise compression of transformer-based models without retraining

Gaochen Dong,W Chen

doi:10.1016/j.neunet.2023.12.001

Abstract

Transformer-based models, exemplified by GPT-3, ChatGPT, and GPT-4, have recently garnered considerable attention in both academia and industry due to their promising performance in general language tasks. Nevertheless, these models typically involve computationally encoding processes, and in some cases, decoding processes as well, both of which are fundamentally large-scale matrix multiplication. These operations bring the inevitable challenges of massive computation resources and huge memory footprint, usually requiring at least 1023 FLOPs and hundreds of gigabytes, respectively. A common method to address this issue is to reduce the computational and memory requirements by applying layerwise quantization to the transformer, replacing the usual fp32 data type with a low-bit equivalent. Unfortunately, this method often leads to decreased model accuracy and necessitates time-consuming retraining. Such retraining not only requires fine-tuning skills but also substantial computational resources, posing challenges for users. To specifically tackle these issues, we propose BCT, a framework of blockwise compression for transformers without retraining, aiming to facilitate model deployment. Unlike layerwise compression methods, BCT achieves finer compression of the entire transformer by operating blockwise. This method mitigates data distribution deviation caused by quantization, eliminating the requirement for retraining. BCT effectively compresses all components of the model, including but not limited to the embedding, matrix multiplication, GELU, Softmax, layer normalization, and intermediate results. In a case study, an efficient model is compressed by BCT achieving up to 7.988x compression. Subsequently, we also evaluate it on several General Language Understanding Evaluation (GLUE) datasets. Experimental results on the majority of GLUE benchmark demonstrate the effectiveness of our method, as BCT achieves less than a 0.9% degradation in accuracy compared to the more than a 1% degradation seen with other methods providing similar or inferior compression ratios.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Blockwise compression of transformer-based models without retraining

Abstract

Talk to us

Similar Papers

More From: Neural Networks

Lead the way for us

Similar Papers

Topic-Controlled Text Generation
Cansen Caglayan ... Murat Karakaya
-
Cansen Caglayan, et. al.Cansen Caglayan ... Murat Karakaya
15 Sep 2021
15 Sep 2021

Retrieval-Augmented Transformer-XL for Close-Domain Dialog Generation
Giovanni Bonetta ... Paul Vozila
The International FLAIRS Conference Proceedings | VOL. 34
Giovanni Bonetta, et. al.Giovanni Bonetta ... Paul Vozila
18 Apr 2021
The International FLAIRS Conference Proceedings | VOL. 34

Усвоение языка у языковых моделей и человека: хронологическое пробинг-исследование
Ekaterina Voloshina ... Oleg Serikov
-
Ekaterina Voloshina, et. al.Ekaterina Voloshina ... Oleg Serikov
18 Jun 2022
18 Jun 2022

A New Human Factor Study in Developing Practical Vision-Based Applications with the Transformer-Based Deep Learning Model
Thitirat Siriborvornratanakul
-
Thitirat SiriborvornratanakulThitirat Siriborvornratanakul
01 Jan 2021
01 Jan 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Blockwise compression of transformer-based models without retraining

Abstract

Talk to us

Similar Papers

More From: Neural Networks