LRQuant+: A Unified and Learnable Framework to Post-Training Quantization for Transformer-Based Large Foundation Models.

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Post-training quantization (PTQ) for transformer-based large foundation models (LFMs) significantly accelerates model inference and relieves memory constraints, without incurring model training. However, existing methods face three main issues: 1) The scaling factors, which are commonly used in scale reparameterization based weight-activation quantization for mitigating the quantization errors, are mostly hand-crafted defined which may lead to suboptimal results; 2) The formulation of current quantization error defined by L2-norm ignores the directional shifts after quantization; 3) Most methods are devised tailored for single scenario, i.e., only evaluated on LLMs or only designed for weight-only quantization, which lacks of a comprehensive evaluation on diverse benchmarks and a broad application scope. To address these challenges, this paper introduces a unified Learnable and Robust post-training Quantization framework for transformer based LFMs and various quantization scenarios, called LRQuant. First, we consider an efficient block-wise learnable paradigm to find optimal scaling factors which are initialized by logarithmic activation equivalent and get suitable clipping range of quantization steps. In addition, we empirically find that only relying on MSE loss could hardly lead to optimal quantization results, so we reformulate the quantization error and then propose a novel loss function based on the negative logarithm of cosine similarity (NLC loss) between outputs of full-precision and quantized block. To fully investigate the potentiality of our learnable paradigm, we propose a more superior version LRQuant+. Specifically, we first propose a dynamically weighted scheme to balance MSE and NLC loss, and then devise learnable rotation vectors to further directly reduce directional gaps. In addition, we improve the block-wise optimization framework into a novel two-branch nature which jointly considers the error propagation and homologous reconstruction error. Extensive experiments demonstrate the superiority of our LRQuantand LRQuant+, as well as their unified effectiveness across various LFMs for both weight-activation and weight-only quantization, especially under challenging quantization scenarios, i.e., W4A4 and W2A16 on LLMs, ViTS, and MLLMs.

Similar Papers
  • Conference Article
  • Cite Count Icon 5
  • 10.1109/itsc45102.2020.9294350
Scalar and Vector Quantization for Learned Image Compression: A Study on the Effects of MSE and GAN Loss in Various Spaces
  • Sep 20, 2020
  • Jonas Lohdefink + 3 more

Recently, learned image compression by means of neural networks has experienced a performance boost by the use of adversarial loss functions. Typically, a generative adversarial network (GAN) is designed with the generator being an autoencoder with quantizer in the bottleneck for compression and reconstruction. It is well known from rate-distortion theory that vector quantizers provide lower quantization errors than scalar quantizers at the same bitrate. Still, learned image compression approaches often use scalar quantization instead. In this work we provide insights into the image reconstruction quality of the often-employed uniform scalar quantizers, non-uniform scalar quantizers, and the rarely employed but bitrate-efficient vector quantizers, all being integrated into backpropagation and operating under the exact same bitrate. Further interesting insights are obtained by our investigation of an MSE loss and a GAN loss. We show that vector quantization is always beneficial for the compression performance both in the latent space and the reconstructed image space. However, image samples demonstrate that the GAN loss produces the more pleasing reconstructed images, while the non-adversarial MSE loss provides better quality scores of various instrumental measures both in the latent space and on the reconstructed images.

  • Research Article
  • Cite Count Icon 1
  • 10.1109/tpami.2025.3528042
Towards Accurate Post-Training Quantization of Vision Transformers via Error Reduction.
  • Apr 1, 2025
  • IEEE transactions on pattern analysis and machine intelligence
  • Yunshan Zhong + 4 more

Post-training quantization (PTQ) for vision transformers (ViTs) has received increasing attention from both academic and industrial communities due to its minimal data needs and high time efficiency. However, many current methods fail to account for the complex interactions between quantized weights and activations, resulting in significant quantization errors and suboptimal performance. This paper presents ERQ, an innovative two-step PTQ method specifically crafted to reduce quantization errors arising from activation and weight quantization sequentially. The first step, Activation quantization error reduction (Aqer), first applies Reparameterization Initialization aimed at mitigating initial quantization errors in high-variance activations. Then, it further mitigates the errors by formulating a Ridge Regression problem, which updates the weights maintained at full-precision using a closed-form solution. The second step, Weight quantization error reduction (Wqer), first applies Dual Uniform Quantization to handle weights with numerous outliers, which arise from adjustments made during Reparameterization Initialization, thereby reducing initial weight quantization errors. Then, it employs an iterative approach to further tackle the errors. In each iteration, it adopts Rounding Refinement that uses an empirically derived, efficient proxy to refine the rounding directions of quantized weights, complemented by a Ridge Regression solver to reduce the errors. Comprehensive experimental results demonstrate ERQ's superior performance across various ViTs variants and tasks. For example, ERQ surpasses the state-of-the-art GPTQ by a notable 36.81% in accuracy for W3A4 ViT-S.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/cac53003.2021.9728246
Fixed-point Quantization for Vision Transformer
  • Oct 22, 2021
  • Zhexin Li + 3 more

Recently, transformer-based models has shown promising results on miscellaneous computer vision tasks. However, its high computation cost makes it neither practical to deploy on mobile devices, nor economic to compute on servers. In this paper, we propose two effective quantization schemes for reducing the memory usage and computation consumption of vision transformers. First, we develop an approximation-based Post-training Quantization (PTQ) approach which optimizes for a set of quantization scaling factors that minimize quantization errors. Moreover, we introduce a learning-based Quantization-aware Training (QAT) approach that allows for model finetuning after inserting quantization operations to restore accuracy. Furthermore, we reveal the complementary effects of learning-based approach and approximation-based approach in QAT and propose an effective strategy for the initialization of quantization parameters. We evaluate our approaches on ImageNet for different vision transformer models. Our quantization algorithms outperform the previous state-of-art approaches for both post-training quantization and quantization-aware training benchmark. With weights and activations in vision transformer quantized to 8-bit integers, we obtain a Ă—4 compression rate of model parameters with an accuracy drop of less than 0.2% for models of various scales.

  • Conference Article
  • Cite Count Icon 12
  • 10.1109/cvprw53098.2021.00277
Do All MobileNets Quantize Poorly? Gaining Insights into the Effect of Quantization on Depthwise Separable Convolutional Networks Through the Eyes of Multi-scale Distributional Dynamics
  • Jun 1, 2021
  • Stone Yun + 1 more

As the “Mobile AI” revolution continues to grow, so does the need to understand the behaviour of edge-deployed deep neural networks. In particular, MobileNets [9], [22] are the go-to family of deep convolutional neural networks (CNN) for mobile. However, they often have significant accuracy degradation under post-training quantization. While studies have introduced quantization-aware training and other methods to tackle this challenge, there is limited understanding into why MobileNets (and potentially depthwise-separable CNNs (DWSCNN) in general) quantize so poorly compared to other CNN architectures. Motivated to gain deeper insights into this phenomenon, we take a different strategy and study the multi-scale distributional dynamics of MobileNet-V1, a set of smaller DWSCNNs, and regular CNNs. Specifically, we investigate the impact of quantization on the weight and activation distributional dynamics as information propagates from layer to layer, as well as overall changes in distributional dynamics at the network level. This fine-grained analysis revealed significant dynamic range fluctuations and a “distributional mismatch” between channelwise and layerwise distributions in DWSCNNs that lead to increasing quantized degradation and distributional shift during information propagation. Furthermore, analysis of the activation quantization errors show that there is greater quantization error accumulation in DWSCNN compared to regular CNNs. The hope is that such insights can lead to innovative strategies for reducing such distributional dynamics changes and improve post-training quantization for mobile.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-031-20083-0_13
Symmetry Regularization and Saturating Nonlinearity for Robust Quantization
  • Jan 1, 2022
  • Sein Park + 2 more

Robust quantization improves the tolerance of networks for various implementations, allowing reliable output in different bit-widths or fragmented low-precision arithmetic. In this work, we perform extensive analyses to identify the sources of quantization error and present three insights to robustify a network against quantization: reduction of error propagation, range clamping for error minimization, and inherited robustness against quantization. Based on these insights, we propose two novel methods called symmetry regularization (SymReg) and saturating nonlinearity (SatNL). Applying the proposed methods during training can enhance the robustness of arbitrary neural networks against quantization on existing post-training quantization (PTQ) and quantization-aware training (QAT) algorithms and enables us to obtain a single weight flexible enough to maintain the output quality under various conditions. We conduct extensive studies on CIFAR and ImageNet datasets and validate the effectiveness of the proposed methods.KeywordsRobust quantizationPost-training quantization (PTQ)Quantization-aware training (QAT)

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icccs55155.2022.9846198
Post-Training Quantization for Longformer with Chunkwise Quantization Granularity and Optimized Percentile
  • Apr 22, 2022
  • Qibin Chen + 7 more

Transformer-based models have been verified successful in many natural language processing and computer vision tasks. Because of computational complexity, many efficient transformer variants have been proposed, including the Long-former, which aims for long document processing. In this paper, we present an effective post-training quantization scheme for Longformer. Based on sliding window attention in Longformer, we propose chunkwise quantization. It can decrease quantization noise caused by significant gaps between ranges of different windows. Besides, to reduce quantization noise caused by clipping, we optimize percentile value by minimizing mean squared error between the original and quantized matrixes. The quantization scheme is evaluated on the TriviaQA task, and the performance is comparable to the float32 model. In addition, it is important that the quantization scheme can be extended to other efficient transformer-based models.

  • Research Article
  • Cite Count Icon 1
  • 10.3390/fi17040185
Mitigating Quantization Errors Due to Activation Spikes in Gated Linear Unit-Based Large Language Models
  • Apr 21, 2025
  • Future Internet
  • Jaewoo Yang + 3 more

Modern large language models (LLMs) achieve state-of-the-art performance through architectural advancements but require high computational costs for inference. Post-training quantization is a widely adopted approach to reduce these costs by quantizing weights and activations to lower precision, such as INT8. However, we identify a critical challenge in activation quantization for GLU (Gated Linear Unit) variants, which are commonly used in the feed-forward networks of modern LLMs like the LLaMA family. Specifically, severe local quantization errors arise due to excessively large activation magnitudes, which we refer to as activation spikes, leading to significant degradation in model performance. Our analysis reveals a systematic pattern of these spikes: they predominantly occur in the FFN (feed-forward network) layers at the early and late layers of the model and are concentrated on a small subset of tokens rather than being uniformly distributed across a token sequence. To mitigate this issue, we propose two empirical methods: Quantization-free Module (QFeM) and Quantization-free Prefix (QFeP), which isolate activation spikes during quantization. Extensive experiments demonstrated that our methods effectively improve activation quantization, particularly in coarse-grained quantization schemes, enhancing the performance of LLMs with GLU variants and addressing the limitations of existing quantization techniques. The code for implementing our methods and reproducing the experiments is publicly available our GitHub repository.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/ijcnn55064.2022.9892902
CoNLoCNN: Exploiting Correlation and Non-Uniform Quantization for Energy-Efficient Low-precision Deep Convolutional Neural Networks
  • Jul 18, 2022
  • Muhammad Abdullah Hanif + 5 more

In today's era of smart cyber-physical systems, Deep Neural Networks (DNNs) have become ubiquitous due to their state-of-the-art performance in complex real-world applications. The high computational complexity of these networks, which translates to increased energy consumption, is the foremost obstacle towards deploying large DNNs in resource-constrained systems. Fixed-Point (FP) implementations achieved through post-training quantization are commonly used to curtail the energy consumption of these networks. However, the uniform quantization intervals in FP restrict the bit-width of data structures to large values due to the need to represent most of the numbers with sufficient resolution and avoid high quantization errors. In this paper, we leverage the key insight that (in most of the scenarios) DNN weights and activations are mostly concentrated near zero and only a few of them have large magnitudes. We propose CoNLoCNN, a framework to enable energy-efficient low-precision deep convolutional neural network inference by exploiting: (1) non-uniform quantization of weights enabling simplification of complex multiplication operations; and (2) correlation between activation values enabling partial compensation of quantization errors at low cost without any run-time overheads. To significantly benefit from non-uniform quantization, we also propose a novel data representation format, Encoded Low-Precision Binary Signed Digit, to compress the bit-width of weights while ensuring direct use of the encoded weight for processing using a novel multiply-and-accumulate (MAC) unit design.

  • Research Article
  • 10.1609/aaai.v39i16.33807
OAC: Output-adaptive Calibration for Accurate Post-training Quantization
  • Apr 11, 2025
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Ali Edalati + 6 more

Deployment of Large Language Models (LLMs) has major computational costs, due to their rapidly expanding size. Compression of LLMs reduces the memory footprint, latency, and energy required for their inference. Post-training Quantization (PTQ) techniques have been developed to compress LLMs while avoiding expensive re-training. Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output. Then, each layer is calibrated using its layer-wise Hessian to update the weights towards minimizing the quantization error. The Hessian is also used for detecting the most salient weights to quantization. Such PTQ approaches are prone to accuracy drop in low-precision quantization. We propose Output-adaptive Calibration (OAC) to incorporate the model output in the calibration process. We formulate the quantization error based on the distortion of the output cross-entropy loss. OAC approximates the output-adaptive Hessian for each layer under reasonable assumptions to reduce the computational complexity. The output-adaptive Hessians are used to update the weight matrices and detect the salient weights towards maintaining the model output. Our proposed method outperforms the state-of-the-art baselines such as SpQR and BiLLM, especially, at extreme low-precision (2-bit and binary) quantization.

  • Research Article
  • Cite Count Icon 3
  • 10.3390/rs16214042
Hierarchical Mixed-Precision Post-Training Quantization for SAR Ship Detection Networks
  • Oct 30, 2024
  • Remote Sensing
  • Hang Wei + 2 more

Convolutional neural network (CNN)-based synthetic aperture radar (SAR) ship detection models operating directly on satellites can reduce transmission latency and improve real-time surveillance capabilities. However, limited satellite platform resources present a significant challenge. Post-training quantization (PTQ) provides an efficient method for pre-training neural networks to effectively reduce memory and computational resources without retraining. Despite this, PTQ faces the challenge of maintaining model accuracy, especially at low-bit quantization (e.g., 4-bit or 2-bit). To address this challenge, we propose a hierarchical mixed-precision post-training quantization (HMPTQ) method for SAR ship detection neural networks to reduce quantization error. This method encompasses a layerwise precision configuration based on reconstruction error and an intra-layer mixed-precision quantization strategy. Specifically, our approach initially utilizes the activation reconstruction error of each layer to gauge the sensitivity necessary for bit allocation, considering the interdependencies among layers, which effectively reduces the complexity of computational sensitivity and achieves more precise quantization allocation. Subsequently, to minimize the quantization error of the layers, an intra-layer mixed-precision quantization strategy based on probability density assigns a greater number of quantization bits to regions where the probability density is low for higher values. Our evaluation on the SSDD, HRSID, and LS-SSDD-v1.0 SAR Ship datasets, using different detection CNN models, shows that the YOLOV9c model with mixed-precision quantization at 4-bit and 2-bit for weights and activations achieves only a 0.28% accuracy loss on the SSDD dataset, while reducing the model size by approximately 80%. Compared to state-of-the-art methods, our approach maintains competitive accuracy, confirming the superior performance of the HMPTQ method over existing quantization techniques.

  • Book Chapter
  • 10.1007/978-981-99-1642-9_7
Adaptive Rounding Compensation for Post-training Quantization
  • Jan 1, 2023
  • Jinhui Lin + 7 more

Network quantization can compress and accelerate deep neural networks by reducing the bit-width of network parameters so that the quantized networks can be deployed to resource-limited devices. Post-Training Quantization (PTQ) is a practical method of generating a hardware-friendly quantized network without re-training or fine-tuning. However, PTQ results in unacceptable accuracy degradation due to disturbance caused by clipping and discarding the rounded remains. To address this problem, we propose Adaptive Rounding Compensation Quantization (ARCQ) to reduce the quantization errors by utilizing the rounded remains and clipping threshold that can be computed in resource-limited devices. Moreover, to leverage accuracy and speed, we propose a dynamic compensation method to select critical layers to be compensated in terms of parameters and quantization errors. Extensive experiments verify that our method can achieve superior results on ImageNet for classification and MSCOCO for object detection. Codes are available at https://github.com/Iconip2022/ARCQ .

  • Research Article
  • 10.1007/s43684-025-00121-0
H-ViT: hardware-friendly post-training quantization for efficient vision transformer inference
  • Dec 23, 2025
  • Autonomous Intelligent Systems
  • Jing Liu + 4 more

Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision tasks. However these models are memory-consuming and computation-intensive, making their deployment and efficient inference on edge devices challenging. Model quantization is a promising approach to reduce model complexity. Prior works have explored tailored quantization algorithms for ViTs but unfortunately retained floating-point (FP) scaling factors, which not only yield non-negligible re-quantization overhead, but also hinder the quantized models to perform efficient integer-only inference. In this paper, we propose H-ViT, a dedicated post-training quantization scheme (e.g., symmetric uniform quantization and layer-wise quantization for both weights and part of activations) to effectively quantize ViTs with fewer Power-of-Two (PoT) scaling factors, thus minimizing the re-quantization overhead and memory consumption. In addition, observing serious inter-channel variation in LayerNorm inputs and outputs, we propose Power-of-Two quantization (PTQ), a systematic method to reducing the performance degradation without hyper-parameters. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that H-ViT offers comparable(or even slightly higher) INT8 quantization performance with PoT scaling factors when compared to the counterpart with floating-point scaling factors. For instance, we reach 78.43 top-1 accuracy with DeiT-S on ImageNet, 51.6 box AP and 44.8 mask AP with Cascade Mask R-CNN (Swin-B) on COCO.

  • Conference Article
  • Cite Count Icon 29
  • 10.1145/3503161.3547826
Towards Accurate Post-Training Quantization for Vision Transformer
  • Oct 10, 2022
  • Yifu Ding + 6 more

Vision transformer emerges as a potential architecture for vision tasks. However, the intense computation and non-negligible delay hinder its application in the real world. As a widespread model compression technique, existing post-training quantization methods still cause severe performance drops. We find the main reasons lie in (1) the existing calibration metric is inaccurate in measuring the quantization influence for extremely low-bit representation, and (2) the existing quantization paradigm is unfriendly to the power-law distribution of Softmax. Based on these observations, we propose a novel Accurate Post-training Quantization framework for Vision Transformer, namely APQ-ViT. We first present a unified Bottom-elimination Blockwise Calibration scheme to optimize the calibration metric to perceive the overall quantization disturbance in a blockwise manner and prioritize the crucial quantization errors that influence more on the final output. Then, we design a Matthew-effect Preserving Quantization for Softmax to maintain the power-law character and keep the function of the attention mechanism. Comprehensive experiments on large-scale classification and detection datasets demonstrate that our APQ-ViT surpasses the existing post-training quantization methods by convincing margins, especially in lower bit-width settings (e.g., averagely up to 5.17% improvement for classification and 24.43% for detection on W4A4). We also highlight that APQ-ViT enjoys versatility and works well on diverse transformer variants.

  • Book Chapter
  • Cite Count Icon 100
  • 10.1007/978-3-030-58536-5_5
Post-training Piecewise Linear Quantization for Deep Neural Networks
  • Jan 1, 2020
  • Jun Fang + 5 more

Quantization plays an important role in the energy-efficient deployment of deep neural networks on resource-limited devices. Post-training quantization is highly desirable since it does not require retraining or access to the full training dataset. The well-established uniform scheme for post-training quantization achieves satisfactory results by converting neural networks from full-precision to 8-bit fixed-point integers. However, it suffers from significant performance degradation when quantizing to lower bit-widths. In this paper, we propose a piecewise linear quantization (PWLQ) scheme (Code will be made available at https://github.com/jun-fang/PWLQ ) to enable accurate approximation for tensor values that have bell-shaped distributions with long tails. Our approach breaks the entire quantization range into non-overlapping regions for each tensor, with each region being assigned an equal number of quantization levels. Optimal breakpoints that divide the entire range are found by minimizing the quantization error. Compared to state-of-the-art post-training quantization methods, experimental results show that our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead.

  • Research Article
  • Cite Count Icon 16
  • 10.1016/j.future.2022.02.005
Quantune: Post-training quantization of convolutional neural networks using extreme gradient boosting for fast deployment
  • Feb 17, 2022
  • Future Generation Computer Systems
  • Jemin Lee + 3 more

Quantune: Post-training quantization of convolutional neural networks using extreme gradient boosting for fast deployment

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.