Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Zhewei Yao,Yuxiong He,Cheng Li,Stephen Youn,Xiaoxia Wu

doi:10.1609/aaai.v38i17.29908

Abstract

Post-training quantization (PTQ) has emerged as a promising technique for mitigating memory consumption and computational costs in large language models (LLMs). However, a systematic examination of various quantization schemes, model families, and quantization bit precision has been absent from the literature. In this paper, we conduct a comprehensive analysis of these factors by investigating the effects of PTQ on weight-only, activation-only, and weight-and-activation quantization using diverse methods such as round-to-nearest (RTN), GPTQ, ZeroQuant, and their variants. We apply these methods to two distinct model families with parameters ranging from 125M to 176B. Our contributions include: (1) a sensitivity analysis revealing that activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization; (2) an evaluation and comparison of existing PTQ methods to optimize model size reduction while minimizing the impact on accuracy, revealing that none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation; (3) based on these insights, we propose an optimized method called Low-Rank Compensation (LoRC), which employs low-rank matrices to enhance model quality recovery with a minimal increase in model size.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence

Lead the way for us

Similar Papers

Xai-driven knowledge distillation of large language models for efficient deployment on low-resource devices
Riccardo Cantini ... Alessio Orsino
Journal of Big Data | VOL. 11
Riccardo Cantini, et. al.Riccardo Cantini ... Alessio Orsino
04 May 2024
Journal of Big Data | VOL. 11

Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Xuan Shen ... Zhengang Li
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38
Xuan Shen, et. al.Xuan Shen ... Zhengang Li
24 Mar 2024
Proceedings of the AAAI Conference on Artificial Intelligence | VOL. 38

Assessing the utility of large language models for phenotype-driven gene prioritization in the diagnosis of rare genetic disease
Junyoung Kim ... Cong Liu
The American Journal of Human Genetics | VOL. 111
Junyoung Kim, et. al.Junyoung Kim ... Cong Liu
01 Sep 2024
The American Journal of Human Genetics | VOL. 111

Learning to match patients to clinical trials using large language models
Maciej Rybinski ... Allan Hanbury
Journal of Biomedical Informatics | VOL. 159
Maciej Rybinski, et. al.Maciej Rybinski ... Allan Hanbury
09 Oct 2024
Journal of Biomedical Informatics | VOL. 159

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation

Abstract

Talk to us

Similar Papers

More From: Proceedings of the AAAI Conference on Artificial Intelligence