MGRQ: Post-Training Quantization For Vision Transformer With Mixed Granularity Reconstruction
Post-training quantization (PTQ) efficiently compresses vision models, but unfortunately, it accompanies a certain degree of accuracy degradation. Reconstruction methods aim to enhance model performance by narrowing the gap between the quantized model and the full-precision model, often yielding promising results. However, efforts to significantly improve the performance of PTQ through reconstruction in the Vision Transformer (ViT) have shown limited efficacy. In this paper, we conduct a thorough analysis of the reasons for this limited effectiveness and propose MGRQ (Mixed Granularity Reconstruction Quantization) as a solution to address this issue. Unlike previous reconstruction schemes, MGRQ introduces a mixed granularity reconstruction approach. Specifically, MGRQ enhances the performance of PTQ by introducing Extra-Block Global Supervision and Intra-Block Local Supervision, building upon Optimized Block-wise Reconstruction. Extra-Block Global Supervision considers the relationship between block outputs and the model's output, aiding block-wise reconstruction through global supervision. Meanwhile, Intra-Block Local Supervision reduces generalization errors by aligning the distribution of outputs at each layer within a block. Subsequently, MGRQ is further optimized for reconstruction through Mixed Granularity Loss Fusion. Extensive experiments conducted on various ViT models illustrate the effectiveness of MGRQ. Notably, MGRQ demonstrates robust performance in low-bit quantization, thereby enhancing the practicality of the quantized model.
- Conference Article
13
- 10.1109/iccrd54409.2022.9730411
- Jan 7, 2022
In recent years, neural network deployment to the target environment is considered a challenging task especially because of heavy burden of hardware requirements that DNN models lay on computation capabilities and power consumption. various methods based on post-training quantification are powerful tools to accelerate the optimization and perfection of deep neural networks. In case of low power edge devices, such as GNA - neural coprocessor, quantization becomes the only way to make the deployment possible. This article summarizes the research and improvement of neural network models and parameters of various methods in post-training quantization. Additionally, this article analyzes different methods and classifies them. Firstly, this article reviews the publications of post-training quantitative literature in recent years, analyzes the most effective cutting-edge algorithms in the field of deep neural networks. Secondly, advanced researches based on the AdaQuant algorithm, integer quantization analysis, and data-free quantization are discussed. It can reduce the model size and simplify the computation complexity. Compared to the state-of-the-art post-training quantization methods, experimental results show that our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead. Moreover, the model accuracy and development potential of post-training quantization in the field of deep neural networks are discussed further. The analysis shows the characteristics of different algorithms and the corresponding changed networks. The characteristics of the model affected by the parameters, and the future research directions and difficulties faced by the post-training quantization method and technology are covered in this article as well.
- Research Article
- 10.1109/tpami.2025.3554523
- Jul 1, 2025
- IEEE transactions on pattern analysis and machine intelligence
Recently, post-training quantization (PTQ) has become the de facto way to produce efficient low-precision neural networks without long-time retraining. Despite its low cost, current PTQ works fail to succeed under the extremely low-bit setting. In this work, we delve into extremely low-bit quantization and construct a unified theoretical analysis, which provides an in-depth understanding of the reason for the failure of low-bit quantization. According to the theoretical study, we argue that the existing methods fail in low-bit schemes due to significant perturbation on weights and lack of consideration of activation quantization. To this end, we propose Brecq and QDrop to respectively solve these two challenges, based on which a Q-Limit framework is constructed. Then the Q-Limit framework is further extended to support a mixed precision quantization scheme. To the best of our knowledge, this is the first work that can push the limit of PTQ down to INT2. Extensive experiments on various handcrafted and searched neural architectures are conducted for both visual recognition/detection tasks and language processing tasks. Without bells and whistles, our PTQ framework can attain low-bit ResNet and MobileNetV2 comparable with quantization-aware training (QAT), establishing a new state-of-the-art for PTQ.
- Research Article
10
- 10.1007/s11227-024-05929-w
- Feb 20, 2024
- The Journal of Supercomputing
This document addresses some inherent problems in Machine Learning (ML), such as the high computational and energy costs associated with their implementation on IoT devices. It aims to study and analyze the performance and efficiency of quantization as an optimization method, as well as the possibility of training ML models directly on an IoT device. Quantization involves reducing the precision of model weights and activations while still maintaining acceptable levels of accuracy. Using representative networks for facial recognition developed with TensorFlow and TensorRT, Post-Training Quantization and Quantization-Aware Training are employed to reduce computational load and improve energy efficiency. The computational experience was conducted on a general-purpose computer featuring an Intel i7-1260P processor and an NVIDIA RTX 3080 graphics card used as an accelerator. Additionally, a NVIDIA Jetson AGX Orin was used as an example of an IoT device. We analyze the feasibility of training on an IoT device, the impact of quantization optimization on knowledge transfer-trained models and evaluate the differences between Post-Training Quantization and Quantization-Aware Training in such networks on different devices. Furthermore, the performance and efficiency of NVIDIA’s inference accelerator (Deep Learning Accelerator - DLA, in its 2.0 version) available at the Jetson Orin architecture are studied. We concluded that the Jetson device is capable of performing training on its own. The IoT device can achieve inference performance similar to that of the more powerful processor, thanks to the optimization process, with better energy efficiency. Post-Training Quantization has shown better performance, while Quantization-Aware Training has demonstrated higher energy efficiency. However, since the accelerator cannot execute certain layers of the models, the use of DLA worsens both the performance and efficiency results.
- Research Article
3
- 10.1016/j.neunet.2025.107558
- Sep 1, 2025
- Neural networks : the official journal of the International Neural Network Society
Progressive fine-to-coarse reconstruction for accurate low-bit post-training quantization in vision transformers.
- Research Article
4
- 10.1016/j.neucom.2023.127120
- Dec 12, 2023
- Neurocomputing
Stabilized activation scale estimation for precise Post-Training Quantization
- Research Article
1
- 10.30871/jaic.v9i4.9700
- Aug 3, 2025
- Journal of Applied Informatics and Computing
Neural Processing Units (NPUs) are dedicated accelerators designed to perform efficient deep learning inference on edge devices with limited computational and power resources. In real-time applications such as automated parking systems, accurate and low-latency license plate recognition is critical. This study evaluates the effectiveness of quantization techniques, specifically Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT), in improving the performance of YOLOv8-based license plate detection models deployed on an Intel NPU integrated within the Core Ultra 7 155H processor. Three model configurations are compared: a full-precision float32 model, a PTQ model, and a QAT model. All models are converted to OpenVINO’s Intermediate Representation (IR) and benchmarked using the benchmark_app tool. Results show that PTQ and QAT significantly enhance inference efficiency. QAT achieves up to 39.9% improvement in throughput and 28.6% reduction in latency compared to the non-quantized model, while maintaining higher detection accuracy. Both quantized models also reduce model size by nearly 50 percent. Although PTQ is simpler to implement, QAT offers a better balance between accuracy and speed, making it more suitable for deployment in edge scenarios with real-time constraints. These findings highlight QAT as an optimal strategy for efficient and accurate license plate recognition on NPU-based edge platforms.
- Book Chapter
16
- 10.1007/978-3-031-20083-0_39
- Jan 1, 2022
Quantization is a very effective optimization technique to reduce hardware cost and memory footprint of deep neural network (DNN) accelerators. In particular, post-training quantization (PTQ) is often preferred as it does not require a full dataset or costly retraining. However, performance of PTQ lags significantly behind that of quantization-aware training especially for low-precision networks (\(\le \)4-bit). In this paper we propose a novel PTQ scheme (Code will be publicly available at https://github.com/sogh5/SubsetQ) to bridge the gap, with minimal impact on hardware cost. The main idea of our scheme is to increase arithmetic precision while retaining the same representational precision. The excess arithmetic precision enables us to better match the input data distribution while also presenting a new optimization problem, to which we propose a novel search-based solution. Our scheme is based on logarithmic-scale quantization, which can help reduce hardware cost through the use of shifters instead of multipliers. Our evaluation results using various DNN models on challenging computer vision tasks (image classification, object detection, semantic segmentation) show superior accuracy compared with the state-of-the-art PTQ methods at various low-bit precisions.KeywordsDeep neural networksLogarithmic-scale quantizationPost-training quantizationSubset quantization
- Conference Article
5
- 10.1109/conf-spml54095.2021.00059
- Nov 1, 2021
As model prediction becomes more and more accurate and the network becomes deeper and deeper, the amount of memory consumed by the neural network becomes a problem, especially on mobile devices. It is also very difficult to balance the tradeoff between computational cost and battery life, which makes mobile devices very hard as well to become smarter. Model quantification techniques provide the opportunity to tackle this tradeoff by reducing the memory bandwidth and storage and improving the system throughput and latency. This paper discusses and compares the state-of-the-art methods of neural network quantification methodologies including Post Training Quantization (PTQ) and Quantization Aware Training (QAT). PTQ directly quantizes the trained floating-point model. The implementation process is simple and does not require quantization during the training phase. QAT requires us to use simulated quantization operations to model the effect of the quantization, and forward and backward passes are usually performed in the floating-point model. Finally, as discussed in the experiments in this paper, we conclude that with the evolution of the quantization techniques, the accuracy gap between PTQ and QAT is shrinking.
- Conference Article
2
- 10.1109/3ict53449.2021.9581835
- Sep 29, 2021
The gradual advancements towards the development of Deep Neural Networks has ultimately led to the development of certain specialised hardware for the application of such Networks with superior performance. But, many of the consumer grade hardwares and Single Board Computers (SBCs) which are used in embedded scenarios are not (yet) computationally efficient enough to perform the execution of even an already trained model within feasible limits. Thus in this paper, we focus on the implication of Post Training Quantization (PTQ) strategies to mitigate the great computational demands arising due to evolution of complex models. As the title implies, the original model incorporates a Generative Adversarial Network to generate Cartoonized versions of provided Real-World input images. This model, in its original state, takes nearly twice as much time to process the output on a single threaded workload. Our Solution to the above stated issue involves Quantizing the pre-trained model from 32-bit Floating Point values to a minimum of 8-bit Integer Values with the addition of Transfer Learning in-general. The results from our testing show that PTQ allowed the model to be compressed to a smaller size as compared to the original novelty, making it ready to be deployed on resource-constrained environments. In addition to that, a significant increase in the Inference Engine's processing performance has also been observed on general purpose hardware.
- Book Chapter
144
- 10.1007/978-3-030-58536-5_5
- Jan 1, 2020
Quantization plays an important role in the energy-efficient deployment of deep neural networks on resource-limited devices. Post-training quantization is highly desirable since it does not require retraining or access to the full training dataset. The well-established uniform scheme for post-training quantization achieves satisfactory results by converting neural networks from full-precision to 8-bit fixed-point integers. However, it suffers from significant performance degradation when quantizing to lower bit-widths. In this paper, we propose a piecewise linear quantization (PWLQ) scheme (Code will be made available at https://github.com/jun-fang/PWLQ ) to enable accurate approximation for tensor values that have bell-shaped distributions with long tails. Our approach breaks the entire quantization range into non-overlapping regions for each tensor, with each region being assigned an equal number of quantization levels. Optimal breakpoints that divide the entire range are found by minimizing the quantization error. Compared to state-of-the-art post-training quantization methods, experimental results show that our proposed method achieves superior performance on image classification, semantic segmentation, and object detection with minor overhead.
- Research Article
7
- 10.3390/info16050348
- Apr 25, 2025
- Information
The deployment of machine learning models on mobile platforms has ushered in a new era of innovation across diverse sectors, including agriculture, where such applications hold immense promise for empowering farmers with cutting-edge technologies. In this context, the threat posed by insects to crop yields during harvest has escalated, fueled by factors such as evolution and climate change-induced shifts in insect behavior. To address this challenge, smart insect monitoring systems and detection models have emerged as crucial tools for farmers and IoT-based systems, enabling interventions to safeguard crops. The primary contribution of this study lies in its systematic investigation of model optimization techniques for edge deployment, including Post-Training Quantization, Quantization-Aware Training, and Data Representative Quantization. As such, we address the crucial need for efficient, on-site pest detection tools in agricultural settings. We provide a detailed analysis of the trade-offs between model size, inference speed, and accuracy across different optimization approaches, ensuring practical applicability in resource-constrained farming environments. Our study explores various methodologies for model development, including the utilization of Mobile-ViT and EfficientNet architectures, coupled with transfer learning and fine-tuning techniques. Using the Dangerous Farm Insects Dataset, we achieve an accuracy of 82.6% and 77.8% on validation and test datasets, respectively, showcasing the efficacy of our approach. Furthermore, we investigate quantization techniques to optimize model performance for on-device inference, ensuring seamless deployment on mobile devices and other edge devices without compromising accuracy. The best quantized model, produced through Post-Training Quantization, was able to maintain a classification accuracy of 77.8% while significantly reducing the model size from 33 MB to 9.6 MB. To validate the generalizability of our solution, we extended our experiments to the larger IP102 dataset. The quantized model produced using Post-Training Quantization was able to maintain a classification accuracy of 59.6% while also reducing the model size from 33 MB to 9.6 MB, thus demonstrating that our solution maintains a competitive performance across a broader range of insect classes.
- Book Chapter
4
- 10.1007/978-3-031-25082-8_8
- Jan 1, 2023
The post-training quantization (PTQ) challenge of bringing quantized neural net accuracy close to original has drawn much attention driven by industry demand. Many of the methods emphasize optimization of a specific per-layer degree of freedom (DoF), such as grid step size, preconditioning factors, nudges to weights and biases, often chained to others in multi-step solutions. Here we rethink quantized network parameterization in HW-aware fashion, towards a unified analysis of all quantization DoF, permitting for the first time their joint end-to-end finetuning. Our single-step simple and extendable method, dubbed quantization-aware finetuning (QFT), achieves 4b-weights quantization results on-par with SoTA within PTQ constraints of speed and resource.
- Research Article
8
- 10.1186/s13677-024-00630-y
- Mar 18, 2024
- Journal of Cloud Computing
Bitcoin exchange security is crucial because of MEC's widespread use. Cryptojacking has compromised MEC app security and bitcoin exchange ecosystem functionality. This paper propose a cutting-edge neural network and AdaHessian optimization technique for cryptojacking prediction and defense. We provide a cutting-edge deep neural network (DNN) cryptojacking attack prediction approach employing pruning, post-training quantization, and AdaHessian optimization. To solve these problems, this paper apply pruning, post-training quantization, and AdaHessian optimization. A new framework for quick DNN training utilizing AdaHessian optimization can detect cryptojacking attempts with reduced computational cost. Pruning and post-training quantization improve the model for low-CPU on-edge devices. The proposed approach drastically decreases model parameters without affecting Cryptojacking attack prediction. The model has Recall 98.72%, Precision 98.91%, F1-Score 99.09%, MSE 0.0140, RMSE 0.0137, and MAE 0.0139. Our solution beats state-of-the-art approaches in precision, computational efficiency, and resource consumption, allowing more realistic, trustworthy, and cost-effective machine learning models. We address increasing cybersecurity issues holistically by completing the DNN optimization-security loop. Securing Crypto Exchange Operations delivers scalable and efficient Cryptojacking protection, improving machine learning, cybersecurity, and network management.
- Conference Article
51
- 10.1145/3503161.3547826
- Oct 10, 2022
Vision transformer emerges as a potential architecture for vision tasks. However, the intense computation and non-negligible delay hinder its application in the real world. As a widespread model compression technique, existing post-training quantization methods still cause severe performance drops. We find the main reasons lie in (1) the existing calibration metric is inaccurate in measuring the quantization influence for extremely low-bit representation, and (2) the existing quantization paradigm is unfriendly to the power-law distribution of Softmax. Based on these observations, we propose a novel Accurate Post-training Quantization framework for Vision Transformer, namely APQ-ViT. We first present a unified Bottom-elimination Blockwise Calibration scheme to optimize the calibration metric to perceive the overall quantization disturbance in a blockwise manner and prioritize the crucial quantization errors that influence more on the final output. Then, we design a Matthew-effect Preserving Quantization for Softmax to maintain the power-law character and keep the function of the attention mechanism. Comprehensive experiments on large-scale classification and detection datasets demonstrate that our APQ-ViT surpasses the existing post-training quantization methods by convincing margins, especially in lower bit-width settings (e.g., averagely up to 5.17% improvement for classification and 24.43% for detection on W4A4). We also highlight that APQ-ViT enjoys versatility and works well on diverse transformer variants.
- Research Article
17
- 10.1016/j.future.2022.02.005
- Feb 17, 2022
- Future Generation Computer Systems
Quantune: Post-training quantization of convolutional neural networks using extreme gradient boosting for fast deployment