Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

MADTP++: Bridge the Gap Between Token and Weight Pruning for Accelerating VLTs.

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

MADTP++ introduces a unified framework combining multi-modality aligned token pruning and hardware-aware weight pruning to compress Vision-Language Transformers, achieving significant reductions in parameters and GFLOPs while maintaining competitive performance through dynamic token elimination, fine-grained pruning, and joint optimization strategies.

Abstract
Translate article icon Translate Article Star icon

Vision-Language Transformers (VLTs) have achieved remarkable success, yet their high computational costs remain challenging due to numerous input tokens and large model parameters. Existing VLT compression methods primarily rely on single-modality-based token pruning or coarse-grained weight pruning techniques. However, these methods face significant obstacles, such as ignoring the critical alignment of different modalities and lacking layer-wise dynamic token pruning flexibility, exhibiting inevitable performance degradation due to coarsegrained weight pruning, and struggling with the simultaneous compression of both input tokens and model parameters. To address those limitations, we propose MADTP++, a novel approach that integrates custom-made token and weight pruning processes into a unified framework, achieving superior compression in both parameter counts and computational costs. Specifically, for the token pruning process, we introduce the Multi-modality Alignment Guidance (MAG) module and the Dynamic Token Pruning (DTP) module to align semantic features across different modalities and guide the dynamic elimination of redundant tokens based on different input instances. For the weight pruning process, we propose a Hardware-aware Weight Pruning (HWP) module that leverages the Sparse Tensor Cores across diverse hardware setups to enable fine-grained parameter pruning within VLTs. To further unify token and weight pruning, we also propose a Cooperative Optimization Training Strategy that automatically allocates GFLOPs and parameter reductions per branch before pruning and employs Knowledge Distillation Constraints to facilitate joint optimization of both pruning dimensions. Extensive experiments conducted on various VLT models and datasets demonstrate that MADTP++ can significantly reduce model parameters and computational costs while maintaining competitive performance.

Similar Papers
  • Research Article
  • Cite Count Icon 18
  • 10.1016/j.neucom.2023.127189
A multi-granularity CNN pruning framework via deformable soft mask with joint training
  • Dec 29, 2023
  • Neurocomputing
  • Peng Zhang + 3 more

A multi-granularity CNN pruning framework via deformable soft mask with joint training

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3447776
Dynamic Regularization on Activation Sparsity for Neural Network Efficiency Improvement
  • Jun 30, 2021
  • ACM Journal on Emerging Technologies in Computing Systems
  • Qing Yang + 3 more

When deploying deep neural networks in embedded systems, it is crucial to decrease the model size and computational complexity for improving the execution speed and efficiency. In addition to conventional compression techniques, e.g., weight pruning and quantization, removing unimportant activations can also dramatically reduce the amount of data communication and the computation cost. Unlike weight parameters, the pattern of activations is directly related to input data and thereby changes dynamically. To regulate the dynamic activation sparsity (DAS), in this work, we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet , features structured activation sparsity with an improved sparsity level. Compared to the static feature map pruning methods, DASNets provide better computation cost reduction. The WTA dropout technique can be easily applied in deep neural networks without incurring additional training variables. More importantly, DASNet can be seamlessly integrated with other compression techniques, such as weight pruning and quantization, without compromising accuracy. Our experiments on various networks and datasets present significant runtime speedups with negligible accuracy losses.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 19
  • 10.3390/electronics8111321
Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning
  • Nov 9, 2019
  • Electronics
  • Mário P Véstias + 3 more

Edge devices are becoming smarter with the integration of machine learning methods, such as deep learning, and are therefore used in many application domains where decisions have to be made without human intervention. Deep learning and, in particular, convolutional neural networks (CNN) are more efficient than previous algorithms for several computer vision applications such as security and surveillance, where image and video analysis are required. This better efficiency comes with a cost of high computation and memory requirements. Hence, running CNNs in embedded computing devices is a challenge for both algorithm and hardware designers. New processing devices, dedicated system architectures and optimization of the networks have been researched to deal with these computation requirements. In this paper, we improve the inference execution times of CNNs in low density FPGAs (Field-Programmable Gate Arrays) using fixed-point arithmetic, zero-skipping and weight pruning. The developed architecture supports the execution of large CNNs in FPGA devices with reduced on-chip memory and computing resources. With the proposed architecture, it is possible to infer an image in AlexNet in 2.9 ms in a ZYNQ7020 and 1.0 ms in a ZYNQ7045 with less than 1% accuracy degradation. These results improve previous state-of-the-art architectures for CNN inference.

  • Dissertation
  • 10.12794/metadc2443127
RingChains Graph-based Summarizer and Enhanced Large Language Models for Summarizing Long Documents
  • May 1, 2025
  • Tam Cong Doan

Large language models (LLMs) have influenced real-world applications after ChatGPT appeared. Although powerful LLMs produce high quality summaries, it remains challenging for LLMs to perform the summary task for long documents. First, LLMs must compute a large number of unimportant input tokens while LLMs perform more than billions operations per an input token because of the complicated architecture and large model sizes. Second, most standard LLMs have a limited context window size. If the number of context tokens is increased by a factor of n, both the required computational resources and the running time scale as n2 in a Transformer architecture or as n√ n in a sparse Transformer architecture. Third, using LLMs typically requires either an internet connection or high-performance local hardware. Fourth, LLMs need vast amounts of training data, and they still cannot entirely avoid hallucinations. Some real-world documents, such as classified files, cannot be used for training, cannot be uploaded to the internet, and cannot tolerate hallucinations. Moreover, approximately one billion people worldwide own computers but lack internet access. These individuals have already been left behind in the internet revolution. We must ensure they are not behind again from the AI revolution. This dissertation proposes RingChains topology graph-based summarizer, which can be implemented to work on any computer. It offers fast execution, unlimited input tokens, high-quality summaries, no training process, and no generating hallucinations. RingChains processed 500 government reports from the zeroSCROLL dataset in 22.06 seconds, whereas GPT40 took 5,749.04 seconds, and both models achieved almost the same level of accuracy. RingChains is particularly suited for domains like classified documents and can help those people with computers but non internet connection participate in the AI revolution. This dissertation also present a RingChains_LLMs , a system significantly reduces computational resource, running time, cost and handle limited input tokens of small window size LLMs but avoids the expensive process of adjusting architecture or of additional training steps to expand the context window size of LLMs . Users can obtain high-quality summaries comparable to those of powerful LLMs while greatly reducing both costs and running time by deploying the RingChains-LLMs system. Both the open-source user application of RingChains and RingChains_LLMs are available on my GitHub (tamdoancong/application).

  • Research Article
  • 10.1609/aaai.v40i31.39814
Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training
  • Mar 14, 2026
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Weilin Wan + 4 more

Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.

  • Conference Article
  • Cite Count Icon 25
  • 10.1109/dac18074.2021.9586152
A Unified DNN Weight Pruning Framework Using Reweighted Optimization Methods
  • Dec 5, 2021
  • Tianyun Zhang + 6 more

To address the large model size and intensive computation requirement of deep neural networks (DNNs), weight pruning techniques have been proposed and generally fall into two categories, i.e., static regularization-based pruning and dynamic regularization-based pruning. However, the former method currently suffers either complex workloads or accuracy degradation, while the latter one takes a long time to tune the parameters to achieve the desired pruning rate without accuracy loss. In this paper, we propose a unified DNN weight pruning framework with dynamically updated regularization terms bounded by the designated constraint. Our proposed method increases the compression rate, reduces the training time and reduces the number of hyper-parameters compared with state-of-the-art ADMM-based hard constraint method.

  • Conference Article
  • Cite Count Icon 59
  • 10.1109/islped.2019.8824944
An Ultra-Efficient Memristor-Based DNN Framework with Structured Weight Pruning and Quantization Using ADMM
  • Jul 1, 2019
  • Geng Yuan + 9 more

The high computation and memory storage of large deep neural networks (DNNs) models pose intensive challenges to the conventional Von-Neumann architecture, incurring substantial data movements in the memory hierarchy. The memristor crossbar array has emerged as a promising solution to mitigate the challenges and enable low-power acceleration of DNNs. Memristor-based weight pruning and weight quantization have been seperately investigated and proven effectiveness in reducing area and power consumption compared to the original DNN model. However, there has been no systematic investigation of memristor-based neuromorphic computing (NC) systems considering both weight pruning and weight quantization. In this paper, we propose an unified and systematic memristor-based framework considering both structured weight pruning and weight quantization by incorporating alternating direction method of multipliers (ADMM) into DNNs training. We consider hardware constraints such as crossbar blocks pruning, conductance range, and mismatch between weight value and real devices, to achieve high accuracy and low power and small area footprint. Our framework is mainly integrated by three steps, i.e., memristor-based ADMM regularized optimization, masked mapping and retraining. Experimental results show that our proposed framework achieves 29.81X (20.88X) weight compression ratio, with 98.38% (96.96%) and 98.29% (97.47%) power and area reduction on VGG-16 (ResNet-18) network where only have 0.5% (0.76%) accuracy loss, compared to the original DNN models. We share our models at link http://bit.ly/2Jp5LHJ.

  • Research Article
  • 10.54254/2755-2721/2025.ast26375
The Current Application Status and Prospects of Pruning Methods in natural language Processing
  • Sep 3, 2025
  • Applied and Computational Engineering
  • Tao Fang

The rapid development of Natural Language Processing (NLP), driven by large-scale pre-trained models like BERT and GPT, has led to surging model parameters and computational complexity, resulting in high resource consumption and slow inference speed. Pruning, as an efficient model compression method, can significantly improve inference efficiency while maintaining model performance by removing redundant parameters or structures, and thus has important application value in NLP. This paper systematically reviews the current application status of pruning methods in NLP, including traditional methods such as weight pruning, structured pruning (such as layer pruning, attention head pruning), and analyzes the practical effects and limitations of these methods in tasks such as text classification, machine translation, question-answering systems, etc. The research shows that pruning techniques can effectively reduce the storage and computational overhead of large models, but still face challenges in dynamic pruning, sparsity optimization, and cross-task generalization. In the future, hybrid approaches that combine adaptive pruning, knowledge distillation, and hardware-aware pruning will become an important research direction. In addition, exploring the impact of pruning on model interpretability and robustness, as well as its fit for multimodal tasks, will also be a focus of future research. This paper aims to provide theoretical references and technical guidance for efficient model design and practice in the field of NLP.

  • Conference Article
  • Cite Count Icon 18
  • 10.1109/fpl57034.2022.00013
Accurate, Low-latency, Efficient SAR Automatic Target Recognition on FPGA
  • Aug 1, 2022
  • Bingyi Zhang + 3 more

Synthetic aperture radar (SAR) automatic target recognition (ATR) is the key technique for remote-sensing image recognition. The state-of-the-art convolutional neural networks (CNNs) for SAR ATR suffer from \emph{high computation cost} and \emph{large memory footprint}, making them unsuitable to be deployed on resource-limited platforms, such as small/micro satellites. In this paper, we propose a comprehensive GNN-based model-architecture {co-design} on FPGA to address the above issues. \emph{Model design}: we design a novel graph neural network (GNN) for SAR ATR. The proposed GNN model incorporates GraphSAGE layer operators and attention mechanism, achieving comparable accuracy as the state-of-the-art work with near $1/100$ computation cost. Then, we propose a pruning approach including weight pruning and input pruning. While weight pruning through lasso regression reduces most parameters without accuracy drop, input pruning eliminates most input pixels with negligible accuracy drop. \emph{Architecture design}: to fully unleash the computation parallelism within the proposed model, we develop a novel unified hardware architecture that can execute various computation kernels (feature aggregation, feature transformation, graph pooling). The proposed hardware design adopts the Scatter-Gather paradigm to efficiently handle the irregular computation {patterns} of various computation kernels. We deploy the proposed design on an embedded FPGA (AMD Xilinx ZCU104) and evaluate the performance using MSTAR dataset. Compared with the state-of-the-art CNNs, the proposed GNN achieves comparable accuracy with $1/3258$ computation cost and $1/83$ model size. Compared with the state-of-the-art CPU/GPU, our FPGA accelerator achieves $14.8\times$/$2.5\times$ speedup (latency) and is $62\times$/$39\times$ more energy efficient.

  • Research Article
  • Cite Count Icon 16
  • 10.1109/tcsi.2022.3184175
SWPU: A 126.04 TFLOPS/W Edge-Device Sparse DNN Training Processor With Dynamic Sub-Structured Weight Pruning
  • Oct 1, 2022
  • IEEE Transactions on Circuits and Systems I: Regular Papers
  • Yang Wang + 4 more

When deploying deep neural networks (DNNs), edge devices training is practical to improve model adaptivity for various user-specific scenarios while avoiding privacy disclosure. However, the training computation is intolerable for edge devices. It inspires sparse DNN training (SDT) into the limelight, which reduces training computation by dynamic weight pruning. Generally, SDT has two strategies based on the pruning granularity: the structured or the unstructured. Unfortunately, both of them suffer from limited training efficiency due to the gap between pruning granularity and hardware implementation. The former is hardware-friendly but has a low pruning ratio, indicating limited computation reduction. The latter has a high pruning ratio, but the unbalanced workload decreases utilization and irregular sparsity distribution causes considerable sparsity processing overhead. This paper proposes a software-hardware co- design to bridge the gap for improving the efficiency of SDT. On the algorithm side, a sub-structured pruning method, achieved with hybrid shape-wise and line-wise pruning, generates a high sparsity ratio and keeps the hardware-friendly property. On the hardware side, a sub-structured weight processing unit (SWPU) effectively handles the hybrid sparsity with three techniques. First, SWPU dynamically reorders the computation sequence with hamming-distance-based clustering, balancing the irregular workload. Second, SWPU performs runtime scheduling by exploiting the feature of sub-structured sparse convolution through a detect-before-load controller, which skips redundant memory access and sparsity processing. Third, SWPU performs sparse convolution by compressing operands with spatial disconnect log-based routing and recovers their location with bi-directional switching, avoiding the power-consumed routing logic. Synthesized with 28nm CMOS technology, SWPU can enable 0.56V-to-1.0V supply voltage with a maximum frequency of 675 MHz. It achieves a 50.1% higher pruning ratio than structured pruning and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.53\times $ </tex-math></inline-formula> higher energy efficiency than unstructured pruning. The peak energy efficiency of SWPU is 126.04TFLOPS/W, outperforming the state-of-the-art training processor by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> . When training a ResNet-18 model, SWPU reduces <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.72\times $ </tex-math></inline-formula> energy and offers <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$4.69\times $ </tex-math></inline-formula> speedup than previous sparse training processors.

  • Book Chapter
  • Cite Count Icon 6
  • 10.1007/978-3-031-20083-0_29
Multi-granularity Pruning for Model Acceleration on Mobile Devices
  • Jan 1, 2022
  • Tianli Zhao + 6 more

For practical deep neural network design on mobile devices, it is essential to consider the constraints incurred by the computational resources and the inference latency in various applications. Among deep network acceleration approaches, pruning is a widely adopted practice to balance the computational resource consumption and the accuracy, where unimportant connections can be removed either channel-wisely or randomly with a minimal impact on model accuracy. The coarse-grained channel pruning instantly results in a significant latency reduction, while the fine-grained weight pruning is more flexible to retain accuracy. In this paper, we present a unified framework for the Joint Channel pruning and Weight pruning, named JCW, which achieves a better pruning proportion between channel and weight pruning. To fully optimize the trade-off between latency and accuracy, we further develop a tailored multi-objective evolutionary algorithm in the JCW framework, which enables one single round search to obtain the accurate candidate architectures for various deployment requirements. Extensive experiments demonstrate that the JCW achieves a better trade-off between the latency and accuracy against previous state-of-the-art pruning methods on the ImageNet classification dataset.

  • Research Article
  • Cite Count Icon 26
  • 10.1049/iet-ipr.2018.6191
Thinning of convolutional neural network with mixed pruning
  • Mar 20, 2019
  • IET Image Processing
  • Wenzhu Yang + 5 more

Deep learning has achieved state‐of‐the‐art performance in accuracy of many computer vision tasks. However, convolutional neural network is difficult to deploy on resource constrained devices due to their limited computation power and memory space. Thus, it is necessary to prune the redundant weights and filters rationally and effectively. Considering that the pruned model still exists, redundancy after weight pruning or filter pruning alone, a method of combining weight pruning and filter pruning is proposed. First, filter pruning is performed, which is to remove filters with least importance and using fine‐tuning to recover the model's accuracy. Then, all connection weights below a threshold are set to zero. Finally, the pruned model obtained by the first two steps is fine‐tuned to recover its predictive accuracy. Experiments on MNIST and CIFAR‐10 datasets demonstrate that the proposed approach is effective and feasible. Compared with only weight pruning or filter pruning, the mixed pruning can achieve higher compression ratio of the model parameters. For LeNet‐5, the proposed approach can achieve a compression rate of 13.01×, with 1% drop in accuracy. For VGG‐16, it can achieve a compression rate of 19.20×, incurring 1.56% accuracy loss.

  • Conference Article
  • Cite Count Icon 11
  • 10.1109/ictai.2019.00197
DASNet: Dynamic Activation Sparsity for Neural Network Efficiency Improvement
  • Nov 1, 2019
  • Qing Yang + 3 more

To improve the execution speed and efficiency of neural networks in embedded systems, it is crucial to decrease the model size and computational complexity. In addition to conventional compression techniques, e.g., weight pruning and quantization, removing unimportant activations can reduce the amount of data communication and the computation cost. Unlike weight parameters, the pattern of activations is directly related to input data and thereby changes dynamically. To regulate the dynamic activation sparsity (DAS), in this work, we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet, features structured activation sparsity with an improved sparsity level. Compared to the static feature map pruning methods, DASNets provide better computation cost reduction. The WTA technique can be easily applied in deep neural networks without incurring additional training variables. Our experiments on various networks and datasets present significant run-time speedups with negligible accuracy loss.

  • Research Article
  • 10.28991/hij-2025-06-03-014
Optimizing AIGC Technology for IoT Devices with Deep Learning
  • Nov 4, 2025
  • HighTech and Innovation Journal
  • Yushui Xiao + 1 more

The present article intends to explore how a deep learning model could be applied to improve the ability of AI-generated content (AIGC) technology in graphic recognition within the IoT ecosystem. Objectives: This research pursues two key objectives: first, the model is compressed to a smaller size and decreased computational cost for on-device deployment on resource-poor IoT devices, and second, it achieves better adaptability through data augmentation and regularization techniques. Methods/Analysis: A purpose-built CNN design was built and trained to solve IoT-specific constraints. Model compression techniques such as weight pruning and quantization were used to reduce resource requirements. To ameliorate this, we applied data augmentation techniques like rotation, shear, and zoom, and regularization techniques like dropout to avoid overfitting. The work was done on MNIST and CIFAR-10 typical datasets using TensorFlow as a deep learning framework. Results: The pattern-recognition accuracy on MNIST and CIFAR-10 datasets achieved are 99.5% and 89.2%, respectively. Moreover, the recognition speed was improved by around 30% since the computational cost of the DL algorithm is effective because of parallel processing, resulting in lower processing time. The compressed model overcame the massive computational complexity, which is more suitable for resource-limited IoT devices. Novelty/Improvement: a new methodology is presented that integrates CNN optimization and model compression in conjunction with sophisticated regularization techniques to develop a suitable solution for the peculiarities of the IoT landscape. Ultimately, overcoming the universal problems like limited resources and real-time processes in this research helps to improve the technological and theoretical support for practical IoT applications and accelerate the practical implementation of AIGC performance optimization across various industries such as smart homes, smart transportation, and smart security.

  • Conference Article
  • Cite Count Icon 6
  • 10.1109/asap49362.2020.00016
Array Aware Training/Pruning: Methods for Efficient Forward Propagation on Array-based Neural Network Accelerators
  • Jul 1, 2020
  • Krishna Teja Chitty-Venkata + 1 more

Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass of the network. The essential component of these devices is an array processor which is composed of multiple individual compute units for efficiently executing Multiplication and Accumulation (MAC) operation. As the size of this array limits the amount of DNN processing of a single layer, the computation is performed in several batches serially leading to extra compute cycles along both the axes. In practice, due to the mismatch between matrix and array sizes, the computation does not map on the array exactly. In this work, we address the issue of minimizing processing cycles on the array by adjusting the DNN model parameters by using a structured hardware array dependent optimization. We introduce two techniques in this paper: Array Aware Training (AAT) for efficient training and Array Aware Pruning (AAP) for efficient inference. Weight pruning is an approach to remove redundant parameters in the network to decrease the size of the network. The key idea behind pruning in this paper is to adjust the model parameters (the weight matrix) so that the array is fully utilized in each computation batch. Our goal is to compress the model based on the size of the array so as to reduce the number of computation cycles. We observe that both the proposed techniques results into similar accuracy as the original network while saving a significant number of processing cycles (75%).

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant