Ax-BxP: Approximate Blocked Computation for Precision-reconfigurable Deep Neural Network Acceleration
Precision scaling has emerged as a popular technique to optimize the compute and storage requirements of Deep Neural Networks (DNNs). Efforts toward creating ultra-low-precision (sub-8-bit) DNNs for efficient inference suggest that the minimum precision required to achieve a given network-level accuracy varies considerably across networks, and even across layers within a network. This translates to a need to support variable precision computation in DNN hardware. Previous proposals for precision-reconfigurable hardware, such as bit-serial architectures, incur high overheads, significantly diminishing the benefits of lower precision. We propose Ax-BxP, a method for approximate blocked computation wherein each multiply-accumulate operation is performed block-wise (a block is a group of bits), facilitating re-configurability at the granularity of blocks. Further, approximations are introduced by only performing a subset of the required block-wise computations to realize precision re-configurability with high efficiency. We design a DNN accelerator that embodies approximate blocked computation and propose a method to determine a suitable approximation configuration for any given DNN. For the AlexNet, ResNet50, and MobileNetV2 DNNs, Ax-BxP achieves improvement in system energy and performance, respectively, over an 8-bit fixed-point (FxP8) baseline, with minimal loss (<1% on average) in classification accuracy. Further, by varying the approximation configurations at a finer granularity across layers and data-structures within a DNN, we achieve improvement in system energy and performance, respectively.
- Conference Article
8
- 10.23919/date48585.2020.9116495
- Mar 1, 2020
Ternary Deep Neural Networks (DNNs), which employ ternary precision for weights and activations, have recently been shown to attain accuracies close to full-precision DNNs, raising interest in their efficient hardware realization. In this work we propose a Non-Volatile Ternary Compute-Enabled memory cell (TeC-Cell) based on ferroelectric transistors (FEFETs) for in-memory computing in the signed ternary regime. In particular, the proposed cell enables storage of ternary weights and employs multi-word-line assertion to perform massively parallel signed dot-product computations between ternary weights and ternary inputs. We evaluate the proposed design at the array level and show 72% and 74% higher energy efficiency for multiply-and-accumulate (MAC) operations compared to standard near-memory computing designs based on SRAM and FEFET, respectively. Furthermore, we evaluate the proposed TeC-Cell in an existing ternary in-memory DNN accelerator. Our results show 3.3X-3.4X reduction in system energy and 4.3X-7X improvement in system performance over SRAM and FEFET based near-memory accelerators, across a wide range of DNN benchmarks including both deep convolutional and recurrent neural networks.
- Dissertation
- 10.17760/d20383685
- May 10, 2021
Driven by the rapid development of deep neural networks (DNNs) in recent years, artificial intelligence applications have been flourishing in a spectrum of fields, such as image classification, object detection, machine translation, speech recognition, and smart homes. However, the enormous number of weight parameters and computations of DNN models require resource-rich devices, resulting in tremendous power and energy consumption. The sizes of the state-of-the-art DNN models are even increasingly massive, which further impede the deployment of DNNs in resource-constrained devices. This dissertation centers around addressing this challenge, and our efforts are classified into two directions: DNN model compression and hardware accelerator design. In pursuit of high-performance and energy-efficient DNN accelerator, investigating technologies beyond the conventional binary computing paradigm is desirable and we consider Stochastic computing (SC) a highly promising candidate. First, SC is a probabilistic computing paradigm, which uses a bit-sequence to represent a probability number by counting the number of ones in the sequence. This feature makes it suitable for DNN inference, which is essentially an approximate computing application and the final decision depends on the probabilities at the output layer. Second, SC is renowned as a footprint saver since many complex arithmetic operations can be implemented with simple logic components. For example, multiplication can be conducted with AND gates in SC. Consequently, these two fascinating features of SC make it a favorable alternative to conventional binary computing. In this dissertation, we present SC-DCNN, the first SC-based DCNN inference accelerator. Specifically, (i) we propose the design of diverse function blocks for the basic operations in DCNN. (ii) We propose the novel feature extraction blocks (FEBs), which are intended for extracting features from input feature maps. (iii) We propose comprehensive techniques to reduce the area and power (energy) consumption of weight storage. (iv) We propose holistic optimizations for the overall SC-DCNN architecture, with carefully selected layer-wise FEB configurations, to minimize area and power (energy) consumption while maintaining high network accuracy. Overall, the proposed SC-DCNN achieves the lowest hardware cost and energy consumption in implementing LeNet-5 compared with the state-of-the-art prior works. Besides emerging computing technologies, structured compression techniques, aiming at reducing the number of weights, and the corresponding accelerator design also require extensive research. Therefore, we study the block-circulant weight matrix (BCM)-based compression, which is suitable for serving this goal. BCM compression partitions the original weight matrix into blocks of square sub-matrices and each sub-matrix is trained into a circulant matrix. In a circulant matrix, each row vector can be produced by shifting its prior row vector by one element. Therefore, the whole matrix can be represented by only the first row vector, achieving significant storage and computation reduction as a result. The effectiveness of the algorithm has been verified on multiple representative DNN models, including both DCNNs for image classification and LSTMs for speech recognition. Moreover, we propose an ASIC accelerator design using the compression method. Experimental results show that the proposed BCM accelerator exhibits remarkable advantages in terms of power, throughput, and energy efficiency, indicating that this method is greatly desirable for resource-constrained devices running DNNs. In order to further boost compression ratios and advance energy-efficient deep learning, we propose ADMM-NN, a model-agnostic and systematic compression framework, unifying DNN pruning and quantization. In the proposed ADMM-NN, DNN compression is formulated as an optimization problem and solved using the alternating direction method of multipliers (ADMM). ADMM-NN first decomposes the optimization problem into two sub-problems. The first sub-problem is a neural network training problem with a regularization term, which regularizes the weight parameters to approach a specific compression pattern. The second problem is to find a local optimal compression pattern, which will then be fed back to the first problem. By iteratively solving the two relatively easy-to-solve sub-problems, a solution of the original problem can be found. Thereby, a high compression ratio can be obtained. Without accuracy loss, ADMM-NN achieved 85× and 24× pruning on LeNet-5 and AlexNet models, respectively. Combining weight pruning and quantization, we achieved 1,910× and 231× reductions in overall model size on these two benchmarks. Besides, 26× and 17.4× weight pruning ratios were observed on VGG-16 and ResNet-50, respectively. Furthermore, we propose a hardware-aware compression framework. Specifically, we studied the relationship between pruning ratios and speedups of running a pruned model, and the discovered relationship curve was then integrated into the framework to guide the pruning strategy. By applying the hardware-aware framework, ASIC synthesis results showed 3.6× overall speedup on conv1-conv5 layers of AlexNet. Based on ADMM-NN, we further propose a structured pruning algorithm for two reasons. First, ADMM-NN is a problem-solving framework that integrates neural network training and compression algorithm, but itself is not fixed to any specific compression algorithm. Therefore, we need to study an effective compression algorithm. Second, the conventional irregular pruning algorithm incurs high index and decoding overhead, and thus little acceleration can be achieved. Therefore, a pruning algorithm, which can produce a regular matrix structure and meanwhile is compatible with ADMM-NN, is also imperative. Since irregular pruning has been empirically proved to be able to achieve the highest compression ratios among a diverse variety of compression techniques, it is desirable to start the study by analyzing irregular pruning masks. We figured out three characteristics in the irregular pruning mask: (i) the number of retained weights in different rows varies significantly, and maintaining this variety helps sustain accuracy; (ii) denser rows are more sensitive to pruning than sparser rows; (iii) a block-max weight masking method is proposed to effectively sustain the overall salience of the weight matrix, meanwhile producing a high regularity. By leveraging the discovered characteristics, we propose the density-adaptive regular block (DARB) pruning, which can simultaneously achieve high compression ratios and high hardware performance. DARB was evaluated on five models, across three major application domains, and it outperforms the state-of-the-art prior generally by 2.4× to 4.8×. Besides, it achieves high decoding efficiency, which is defined as the number of activations selected for the corresponding retained weights per clock cycle. The hardware synthesis results showed that DARB outperforms block pruning with block size 4 × 4 and 8 × 8 by 14.3× and 3.6×, respectively. Meanwhile, DARB outperforms them on pruning ratios by 1.8× and 2.5×, respectively.
- Conference Article
575
- 10.1109/isca.2018.00069
- Jun 1, 2018
Hardware acceleration of Deep Neural Networks (DNNs) aims to tame their enormous compute intensity. Fully realizing the potential of acceleration in this domain requires understanding and leveraging algorithmic properties of DNNs. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent loss of accuracy, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer individually. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or inevitably lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of Bit Fusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss [1] and Stripes [2]. In the same area, frequency, and process technology, Bit Fusion offers 3.9X speedup and 5.1X energy savings over Eyeriss. Compared to Stripes, Bit Fusion provides 2.6X speedup and 3.9X energy reduction at 45 nm node when Bit Fusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, Bit Fusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while Bit Fusion merely consumes 895 milliwatts of power.
- Research Article
4176
- 10.1109/jproc.2017.2761740
- Dec 1, 2017
- Proceedings of the IEEE
Deep neural networks (DNNs) are currently widely used for many artificial intelligence (AI) applications including computer vision, speech recognition, and robotics. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, techniques that enable efficient processing of DNNs to improve energy efficiency and throughput without sacrificing application accuracy or increasing hardware cost are critical to the wide deployment of DNNs in AI systems. This article aims to provide a comprehensive tutorial and survey about the recent advances toward the goal of enabling efficient processing of DNNs. Specifically, it will provide an overview of DNNs, discuss various hardware platforms and architectures that support DNNs, and highlight key trends in reducing the computation cost of DNNs either solely via hardware design changes or via joint hardware design and DNN algorithm changes. It will also summarize various development resources that enable researchers and practitioners to quickly get started in this field, and highlight important benchmarking metrics and design considerations that should be used for evaluating the rapidly growing number of DNN hardware designs, optionally including algorithmic codesigns, being proposed in academia and industry. The reader will take away the following concepts from this article: understand the key design considerations for DNNs; be able to evaluate different DNN hardware implementations with benchmarks and comparison metrics; understand the tradeoffs between various hardware architectures and platforms; be able to evaluate the utility of various DNN design techniques for efficient processing; and understand recent implementation trends and opportunities.
- Research Article
70
- 10.1016/j.micpro.2022.104441
- Jan 12, 2022
- Microprocessors and Microsystems
Review of ASIC accelerators for deep neural network
- Conference Article
28
- 10.1145/3243176.3243180
- Nov 1, 2018
With the proliferation of AI-based applications and services, there are strong demands for efficient processing of deep neural networks (DNNs). DNNs are known to be both compute-and memory-intensive as they require a tremendous amount of computation and large memory space. Quantization is a popular technique to boost efficiency of DNNs by representing a number with fewer bits, hence reducing both computational strength and memory footprint. However, it is a difficult task to find an optimal number representation for a DNN due to a combinatorial explosion in feasible number representations with varying bit widths, which is only exacerbated by layer-wise optimization. Besides, existing quantization techniques often target a specific DNN framework and/or hardware platform, lacking portability across various execution environments. To address this, we propose libnumber, a portable, automatic quantization framework for DNNs. By introducing Number abstract data type (ADT), libnumber encapsulates the internal representation of a number from the user. Then the auto-tuner of libnumber finds a compact representation (type, bit width, and bias) for the number that minimizes the user-supplied objective function, while satisfying the accuracy constraint. Thus, libnumber effectively separates the concern of developing an effective DNN model from low-level optimization of number representation. Our evaluation using eleven DNN models on two DNN frameworks targeting an FPGA platform demonstrates over 8× (7×) reduction in the parameter size on average when up to 7% (1%) loss of relative accuracy is tolerable, with a maximum reduction of 16×, compared to the baseline using 32-bit floating-point numbers. This leads to an geomean speedup of 3.79× with a maximum speedup of 12.77× over the baseline, while requiring only minimal programmer effort.
- Research Article
28
- 10.1147/jrd.2019.2947011
- Nov 1, 2019
- IBM Journal of Research and Development
Deep neural networks (DNNs) achieve best-known accuracies in many machine learning tasks involved in image, voice, and natural language processing and are being used in an ever-increasing range of applications. However, their algorithmic benefits are accompanied by extremely high computation and storage costs, sparking intense efforts in optimizing the design of computing platforms for DNNs. Today, graphics processing units (GPUs) and specialized digital CMOS accelerators represent the state-of-the-art in DNN hardware, with near-term efforts focusing on approximate computing through reduced precision. However, the ever-increasing complexities of DNNs and the data they process have fueled an active interest in alternative hardware fabrics that can deliver the next leap in efficiency. Resistive crossbars designed using emerging nonvolatile memory technologies have emerged as a promising candidate building block for future DNN hardware fabrics since they can natively execute massively parallel vector-matrix multiplications (the dominant compute kernel in DNNs) in the analog domain within the memory arrays. Leveraging in-memory computing and dense storage, resistive-crossbar-based systems cater to both the high computation and storage demands of complex DNNs and promise energy efficiency beyond current DNN accelerators by mitigating data transfer and memory bottlenecks. However, several design challenges need to be addressed to enable their adoption. For example, the overheads of peripheral circuits (analog-to-digital converters and digital-to-analog converters) and other components (scratchpad memories and on-chip interconnect) may significantly diminish the efficiency benefits at the system level. Additionally, the analog crossbar computations are intrinsically subject to noise due to a range of device- and circuit-level nonidealities, potentially leading to lower accuracy at the application level. In this article, we highlight the prospects for designing hardware accelerators for neural networks using resistive crossbars. We also underscore the key open challenges and some possible approaches to address them.
- Research Article
7
- 10.1007/s00521-021-06113-4
- Jan 1, 2021
- Neural Computing & Applications
Deep neural networks (DNNs) have demonstrated super performance in most learning tasks. However, a DNN typically contains a large number of parameters and operations, requiring a high-end processing platform for high-speed execution. To address this challenge, hardware-and-software co-design strategies, which involve joint DNN optimization and hardware implementation, can be applied. These strategies reduce the parameters and operations of the DNN, and fit it into a low-resource processing platform. In this paper, a DNN model is used for the analysis of the data captured using an electrochemical method to determine the concentration of a neurotransmitter and the recoding electrode. Next, a DNN miniaturization algorithm is introduced, involving combined pruning and compression, to reduce the DNN resource utilization. Here, the DNN is transformed to have sparse parameters by pruning a percentage of its weights. The Lempel–Ziv–Welch algorithm is then applied to compress the sparse DNN. Next, a DNN overlay is developed, combining the decompression of the DNN parameters and DNN inference, to allow the execution of the DNN on a FPGA on the PYNQ-Z2 board. This approach helps avoid the need for inclusion of a complex quantization algorithm. It compresses the DNN by a factor of 6.18, leading to about 50% reduction in the resource utilization on the FPGA.
- Conference Article
1
- 10.1109/ispa-bdcloud-socialcom-sustaincom57177.2022.00033
- Dec 1, 2022
The idea of using inexact computation for overprovisioned DNNs (Deep Neural Networks) to decrease power and la-tency at the cost of minor accuracy degradation has become very popular. However, there is still no general method to schedule DNN computations on a given hardware platform to effectively implement this idea without loss in computational efficiency. Most contemporary methods require specialized hardware, ex-tensive retraining and hardware-specific scheduling schemes. We present FAWS: Fault-Aware Weight Scheduler for scheduling DNN computations in heterogeneous and faulty hardware. Given a trained DNN model and a hardware fault profile, our scheduler is able to recover significant accuracy during inference even at high fault rates. FAWS schedules the computations such that the low priority ones are allocated to inexact hardware. This is achieved by shuffling (exchanging) the rows of the matrices. The best shuffling order for a given DNN model and hardware fault profile is determined using Genetic Algorithms (GA). We simulate bitwise errors on different model architectures and datasets with different types of fault profiles and observe that FAWS can recover up to 30% of classification accuracy even at high fault rates (which correspond to approximately 50 % power savings).
- Research Article
1
- 10.1149/ma2024-01211293mtgabs
- Aug 9, 2024
- Electrochemical Society Meeting Abstracts
Emerging non-volatile memory (NVM) devices, such as STT-MRAM, PCM, RRAM, have been explored for embedded memory and storage applications to replace CMOS-based SRAM/DRAM and Flash devices. Recently, many of these memory devices have been utilized for new computing paradigms beyond Boolean logic and von Neumann architectures. For example, in-memory analog computing reduces data movement between computing and memory units and exploits the intrinsic parallelism in memory arrays. It finds a natural application in deep neural network (DNN) accelerators by implementing high-throughput high-efficiency multiply accumulate (MAC) operations. Here the conductance of memory devices in a crossbar array represents DNN weights and the activations are encoded in input electrical signals (e.g., pulse height or duration). The MAC operation is conducted via the Ohm’s law (multiplication between voltage and conductance) and Kirchhoff’s law (accumulate via current summation) at constant time even for very large networks. DNN has surpassed human performance in various AI applications, e.g., image classification, natural language processing, etc. While general-purpose CPU/GPU and special-purpose digital accelerators provide current and near-term DNN hardware, there are longer-term opportunities for analog DNN accelerators based on emerging memory devices to achieve significantly higher performance and energy-efficiency. At the same time, analog accelerators impose new requirements on these devices beyond traditional memory applications, e.g., analog tunability, gradual and symmetric weight modulation, high precision, etc. Memory devices with analog nature in their physical mechanisms (e.g., filament growth in RRAM) may be optimized to meet these requirements, while some abrupt and asymmetric characteristics (e.g., filament rupture) present challenges. Increasingly large neural network models have been demonstrated on these memory arrays designed as analog accelerators, but they are still orders of magnitude smaller than state-of-the-art DNN models. While analog accelerators enable massively parallel computation, they are also susceptible to unique challenges in analog devices and circuitry (e.g., device variability, circuit noise), which may degrade network performance (e.g., accuracy).To benefit from the massively parallel MAC operation in analog memory arrays, these arrays need to be large enough to efficiently map the layers in modern DNN models. Among emerging NVM devices, PCM has the advantages of maturity and the availability of large-scale arrays, but also face some challenges in device characteristics, e.g., conductance drift, asymmetry, and noise. PCM-based analog DNN accelerators have been demonstrated at advanced technology node with millions of devices and achieved iso-accuracy on increasingly large network models. These accelerators integrate highly efficient analog PCM tiles for MAC operations with advanced CMOS circuitry for auxiliary digital functions. While material/device engineering continues to be explored to improve the analog properties of PCM devices, design and operation innovations can also help to improve the performance of PCM-based DNN weights, e.g., multiple-device-per-weight design, close-loop tunning. In addition, circuit innovations are essential for analog accelerator performance. Fig. 1 shows a 14nm PCM-based DNN inference accelerator, which incorporate design techniques such as 4-PCM weight units, 2D mesh for tile-to-tile communication, pulse-duration-based coding, etc. On top of technology and design innovations, some DNN models can also be modified to be more resilient against hardware imperfection and noise. PCM-based analog accelerators have achieved iso-accuracy on large DNN models with millions of weights. This talk will discuss the progress that we have achieved on PCM-based analog DNN inference accelerators, the challenges of PCM materials and devices, and promising solutions in technology and design. Figure 1
- Conference Article
- 10.1109/cleoe-eqec.2019.8872657
- Jun 1, 2019
Recent years have seen marked developments in deep neural networks (DNNs) stemming from advances in hardware and increasingly large datasets. DNNs are now routinely used in domains including computer vision and language processing. At their core, DNNs rely heavily on multiply-accumulate (MAC) operations making them well-suited for the highly parallel computational abilities of GPUs. GPUs, however, are von Neumann in architecture and physically separate memory blocks from computational blocks. This exacts an unavoidable time and energy cost associated with data transport known as the von-Neumann bottleneck. While incremental advances in digital hardware accelerators mitigating the von Neumann bottleneck will continue, we explore the potentially game-changing advantages of non-von Neumann architectures that perform MAC operations within the memory. This is achieved using a crossbar array of analog memory as shown in Fig. 1, which serves as the basis of our analog DNN hardware accelerators, and is amenable to both DNN training and forward inference [1], [2]. Recent work from our group has shown analog DNN hardware accelerators capable of 280× speedup in per area throughput while also providing 100× increase in energy efficiency over state-of-the-art GPUs [3].
- Conference Article
88
- 10.1145/3373087.3375306
- Feb 23, 2020
Recent breakthroughs in Deep Neural Networks (DNNs) have fueled a growing\ndemand for DNN chips. However, designing DNN chips is non-trivial because: (1)\nmainstream DNNs have millions of parameters and operations; (2) the large\ndesign space due to the numerous design choices of dataflows, processing\nelements, memory hierarchy, etc.; and (3) an algorithm/hardware co-design is\nneeded to allow the same DNN functionality to have a different decomposition,\nwhich would require different hardware IPs to meet the application\nspecifications. Therefore, DNN chips take a long time to design and require\ncross-disciplinary experts. To enable fast and effective DNN chip design, we\npropose AutoDNNchip - a DNN chip generator that can automatically generate both\nFPGA- and ASIC-based DNN chip implementation given DNNs from machine learning\nframeworks (e.g., PyTorch) for a designated application and dataset.\nSpecifically, AutoDNNchip consists of two integrated enablers: (1) a Chip\nPredictor, built on top of a graph-based accelerator representation, which can\naccurately and efficiently predict a DNN accelerator's energy, throughput, and\narea based on the DNN model parameters, hardware configuration,\ntechnology-based IPs, and platform constraints; and (2) a Chip Builder, which\ncan automatically explore the design space of DNN chips (including IP\nselection, block configuration, resource balancing, etc.), optimize chip design\nvia the Chip Predictor, and then generate optimized synthesizable RTL to\nachieve the target design metrics. Experimental results show that our Chip\nPredictor's predicted performance differs from real-measured ones by < 10% when\nvalidated using 15 DNN models and 4 platforms (edge-FPGA/TPU/GPU and ASIC).\nFurthermore, accelerators generated by our AutoDNNchip can achieve better (up\nto 3.86X improvement) performance than that of expert-crafted state-of-the-art\naccelerators.\n
- Book Chapter
- 10.1007/978-3-031-22039-5_1
- Jan 1, 2022
The ever-increasing complexity of both Deep Neural Networks (DNN) and hardware accelerators has made the co-optimization of these domains extremely complex. Previous works typically focus on optimizing DNNs given a fixed hardware configuration or optimizing a specific hardware architecture given a fixed DNN model. Recently, the importance of the joint exploration of the two spaces draw more and more attention. Our work targets the co-optimization of DNN and hardware configurations on edge GPU accelerator. We investigate the importance of the joint exploration of DNN and edge GPU configurations. We propose an evolutionary-based co-optimization strategy for DNN by considering three metrics: DNN accuracy, execution latency, and power consumption. By combining the two search spaces, we have observed that we can explore more solutions and obtain a better tradeoff between DNN accuracy and hardware efficiency. Experimental results show that the co-optimization outperforms the optimization of DNN for fixed hardware configuration with up to 53% hardware efficiency gains for the same accuracy and latency.
- Conference Article
29
- 10.1109/cfis.2018.8336623
- Feb 1, 2018
In the recent years the applications of deep neural networks are increasing rapidly. There are two important factors determining the efficiency of training a computer vision system using deep neural networks. The first factor is the difficulty of training a very deep neural network with large number of parameters. The second factor is the efficiency of the trained network for decreasing the computational cost. In this paper an efficient deep neural network which uses the grid size reduction, factorization and hyper parameter tuning is proposed. In order to deal with large number of layers the residual units are used. A series of experimental simulations are performed on the application of the proposed deep neural network for classification of aerial images. The experimental results show that the proposed architecture has acceptable accuracy for aerial scene classification.
- Conference Article
8
- 10.1109/iccd53106.2021.00088
- Oct 1, 2021
Deep neural networks (DNNs) have achieved remarkable success in many fields. Large-scale DNNs also bring storage challenges when storing snapshots for preventing clusters’ frequent failures, and bring massive internet traffic when dispatching or updating DNNs for resource-constrained devices (e.g., IoT devices, mobile phones). Several approaches are aiming to compress DNNs. The Recent work, Delta-DNN, notices high similarity existed in DNNs and thus calculates differences between them for improving the compression ratio.However, we observe that Delta-DNN, applying traditional global lossy quantization technique in calculating differences of two neighboring versions of the DNNs, can not fully exploit the data similarity between them for delta compression. This is because the parameters’ value ranges (and also the delta data in Delta-DNN) are varying among layers in DNNs, which inspires us to propose a local-sensitive quantization scheme: the quantizers are adaptive to parameters’ local value ranges in layers. Moreover, instead of quantizing differences of DNNs in Delta-DNN, our approach quantizes DNNs before calculating differences to make the differences more compressible. Besides, we also propose an error feedback mechanism to reduce DNNs’ accuracy loss caused by the lossy quantization.Therefore, we design a novel quantization-based delta compressor called QD-Compressor, which calculates the lossy differences between epochs of DNNs for saving storage cost of backing up DNNs’ snapshots and internet traffic of dispatching DNNs for resource-constrained devices. Experiments on several popular DNNs and datasets show that QD-Compressor obtains a compression ratio of 2.4× ~ 31.5× higher than the state-of-the-art approaches while well maintaining the model’s test accuracy.