Fluctuation-Based Adaptive Structured Pruning for Large Language Models

Yongqi An,Jinqiao Wang,Tao Yu,Ming Tang,Xu Zhao

doi:10.1609/aaai.v38i10.28960

Abstract

Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of Large Language Models (LLMs). Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Fluctuation-Based Adaptive Structured Pruning for Large Language Models

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Pruning for Power: Optimizing Energy Efficiency in IoT with Neural Network Pruning
Thomas Widmann ... Florian Merkle
-
Thomas Widmann, et. al.Thomas Widmann ... Florian Merkle
01 Jan 2023
01 Jan 2023

Exploring Compute-in-Memory Architecture Granularity for Structured Pruning of Neural Networks
Fan-Hsuan Meng ... Eric Yeu-Jer Lee
IEEE Journal on Emerging and Selected Topics in Circuits and Systems | VOL. 12
Fan-Hsuan Meng, et. al.Fan-Hsuan Meng ... Eric Yeu-Jer Lee
01 Dec 2022
IEEE Journal on Emerging and Selected Topics in Circuits and Systems | VOL. 12

Jump-GRS: a multi-phase approach to structured pruning of neural networks for neural decoding
Xiaomin Wu ... Shuvra S Bhattacharyya
Journal of neural engineering | VOL. 20
Xiaomin Wu, et. al.Xiaomin Wu ... Shuvra S Bhattacharyya
31 Jul 2023
Journal of neural engineering | VOL. 20

Heuristic Method for Minimizing Model Size of CNN by Combining Multiple Pruning Techniques.
Danhe Tian ... Koichi Wada
Sensors (Basel, Switzerland) | VOL. 22
Danhe Tian, et. al.Danhe Tian ... Koichi Wada
05 Aug 2022
Sensors (Basel, Switzerland) | VOL. 22

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Fluctuation-Based Adaptive Structured Pruning for Large Language Models

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence