Evolutionary-Based Co-optimization of DNN and Hardware Configurations on Edge GPU
Abstract The ever-increasing complexity of both Deep Neural Networks (DNN) and hardware accelerators has made the co-optimization of these domains extremely complex. Previous works typically focus on optimizing DNNs given a fixed hardware configuration or optimizing a specific hardware architecture given a fixed DNN model. Recently, the importance of the joint exploration of the two spaces draw more and more attention. Our work targets the co-optimization of DNN and hardware configurations on edge GPU accelerator. We investigate the importance of the joint exploration of DNN and edge GPU configurations. We propose an evolutionary-based co-optimization strategy for DNN by considering three metrics: DNN accuracy, execution latency, and power consumption. By combining the two search spaces, we have observed that we can explore more solutions and obtain a better tradeoff between DNN accuracy and hardware efficiency. Experimental results show that the co-optimization outperforms the optimization of DNN for fixed hardware configuration with up to 53% hardware efficiency gains for the same accuracy and latency.
- Conference Article
2
- 10.1109/dsd57027.2022.00060
- Aug 1, 2022
The ever-increasing complexity of both Deep Neural Networks (DNN) and hardware accelerators has made the co-optimization of these domains extremely complex. Previous works typically focus on optimizing DNNs given a fixed hardware configuration or optimizing a specific hardware architecture given a fixed DNN model. Recently, the importance of the joint exploration of the two spaces drew more and more attention. Our work targets the co-optimization of DNN and hardware configurations on edge GPU accelerators. We propose an evolutionary-based co-optimization strategy by considering three metrics: DNN accuracy, execution latency, and power consumption. By combining the two search spaces, a larger number of configurations can be explored in a short time interval. In addition, a better tradeoff between DNN accuracy and hardware efficiency can be obtained. Experimental results show that the co-optimization outperforms the optimization of DNN for fixed hardware configuration with up to 53% hardware efficiency gains with the same accuracy and inference time.
- Research Article
15
- 10.1145/3270689
- Sep 30, 2018
- ACM Transactions on Reconfigurable Technology and Systems
Hardware implementations of deep neural networks (DNNs) have been adopted in many systems because of their higher classification speed. However, while they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby limiting their wide adoption. The energy consumption of DNNs is driven by both memory accesses and computation. Binarized neural networks (BNNs), as a tradeoff between accuracy and energy consumption, can achieve great energy reduction and have good accuracy for large DNNs due to their regularization effect. However, BNNs show poor accuracy when a smaller DNN configuration is adopted. In this article, we propose a new DNN architecture, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds. Our theoretical analysis for LightNNs shows that their accuracy is maintained while dramatically reducing storage and energy requirements. For a fixed DNN configuration, LightNNs have better accuracy at a slight energy increase than BNNs, yet are more energy efficient with only slightly less accuracy than conventional DNNs. Therefore, LightNNs provide more options for hardware designers to trade off accuracy and energy. Moreover, for large DNN configurations, LightNNs have a regularization effect, making them better in accuracy than conventional DNNs. These conclusions are verified by experiment using the MNIST and CIFAR-10 datasets for different DNN configurations. Our FPGA implementation for conventional DNNs and LightNNs confirms all theoretical and simulation results and shows that LightNNs reduce latency and use fewer FPGA resources compared to conventional DNN architectures.
- Conference Article
40
- 10.1145/3060403.3060465
- May 10, 2017
Application-specific integrated circuit (ASIC) implementations for Deep Neural Networks (DNNs) have been adopted in many systems because of their higher classification speed. However, although they may be characterized by better accuracy, larger DNNs require significant energy and area, thereby limiting their wide adoption. The energy consumption of DNNs is driven by both memory accesses and computation. Binarized Neural Networks (BNNs), as a trade-off between accuracy and energy consumption, can achieve great energy reduction, and have good accuracy for large DNNs due to its regularization effect. However, BNNs show poor accuracy when a smaller DNN configuration is adopted. In this paper, we propose a new DNN model, LightNN, which replaces the multiplications to one shift or a constrained number of shifts and adds. For a fixed DNN configuration, LightNNs have better accuracy at a slight energy increase than BNNs, yet are more energy efficient with only slightly less accuracy than conventional DNNs. Therefore, LightNNs provide more options for hardware designers to make trade-offs between accuracy and energy. Moreover, for large DNN configurations, LightNNs have a regularization effect, making them better in accuracy than conventional DNNs. These conclusions are verified by experiment using the MNIST and CIFAR-10 datasets for different DNN configurations.
- Conference Article
31
- 10.1109/iccad45719.2019.8942046
- Nov 1, 2019
The ever increasing computational cost of Deep Neural Networks (DNN) and the demand for energy efficient hardware for DNN acceleration has made accuracy and hardware cost co-optimization for DNNs tremendously important, especially for edge devices. Owing to the large parameter space and cost of evaluating each parameter in the search space, manually tuning of DNN hyperparameters is impractical. Automatic joint DNN and hardware hyperparameter optimization is indispensable for such problems. Bayesian optimization-based approaches have shown promising results for hyperparameter optimization of DNNs. However, most of these techniques have been developed without considering the underlying hardware, thereby leading to inefficient designs. Further, the few works that perform joint optimization are not generalizable and mainly focus on CMOS-based architectures. In this work, we present a novel pseudo agent-based multiobjective hyperparameter optimization (PABO) for maximizing the DNN performance while obtaining low hardware cost. Compared to the existing methods, our work poses a theoretically different approach for joint optimization of accuracy and hardware cost and focuses on memristive crossbar based accelerators. PABO uses a supervisor agent to establish connections between the posterior Gaussian distribution models of network accuracy and hardware cost requirements. The agent reduces the mathematical complexity of the co-optimization problem by removing unnecessary computations and updates of acquisition functions, thereby achieving significant speed-ups for the optimization procedure. PABO outputs a Pareto frontier that underscores the trade-offs between designing high-accuracy and hardware efficiency. Our results demonstrate a superior performance compared to the state-of-the-art methods both in terms of accuracy and computational speed (~100x speed up).
- Research Article
172
- 10.1109/tai.2021.3067574
- May 4, 2021
- IEEE Transactions on Artificial Intelligence
A variety of methods have been applied to the architectural configuration and learning or training of artificial deep neural networks (DNN). These methods play a crucial role in the success or failure of the DNN for most problems and applications. Evolutionary algorithms (EAs) are gaining momentum as a computationally feasible method for the automated optimization of DNNs. Neuroevolution is a term, which describes these processes of automated configuration and training of DNNs using EAs. While many works exist in the literature, no comprehensive surveys currently exist focusing exclusively on the strengths and limitations of using neuroevolution approaches in DNNs. Absence of such surveys can lead to a disjointed and fragmented field preventing DNNs researchers potentially adopting neuroevolutionary methods in their own research, resulting in lost opportunities for wider application within real-world deep learning problems. This article presents a comprehensive survey, discussion, and evaluation of the state-of-the-art in using EAs for architectural configuration and training of DNNs. This article highlights the most pertinent current issues and challenges in neuroevolution and identifies multiple promising future research directions. <p xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><i>Impact Statement—</i>The concept of deep learning originated from the study of artificial neural networks (ANNs). ANNs have achieved extraordinary results in a variety of diverse application areas. Numerous methods have been applied to the architectural configuration and learning or training of artificial DNN and these methods play a crucial role in the success or failure of the DNN for most problems and applications. Recently, EAs have been gaining momentum as a computationally feasible method (called neuroevolution) for the automated configuration and learning or training of DNNs. This article reviews over 170 recent scientific papers describing how major EAs paradigms are being applied by researchers to the configuration and optimization of multiple DNNs. By articulating a clear understanding of the context, state-of-the-art, and feasibility of Neuroevolution, researchers in AI, EAs, and DNN will benefit from this article. The impact of this article comes from contributing toward enhancing research capacity, knowledge, and skills for researchers currently working in neuroevolution and actively engaging those considering becoming involved in this area.
- Research Article
2
- 10.1007/s00521-024-09719-6
- May 2, 2024
- Neural Computing and Applications
Deep neural networks (DNNs) have been applied in many pattern recognition or object detection applications. DNNs generally consist of millions or even billions of parameters. These demanding computational storage and requirements impede deployments of DNNs in resource-limited devices, such as mobile devices, micro-controllers. Simplification techniques such as pruning have commonly been used to slim DNN sizes. Pruning approaches generally quantify the importance of each component such as network weight. Weight values or weight gradients in training are commonly used as the importance metric. Small weights are pruned and large weights are kept. However, small weights are possible to be connected with significant weights which have impact to DNN outputs. DNN accuracy can be degraded significantly after the pruning process. This paper proposes a roulette wheel-like pruning algorithm, in order to simplify a trained DNN while keeping the DNN accuracy. The proposed algorithm generates a branch of pruned DNNs which are generated by a roulette wheel operator. Similar to the roulette wheel selection in genetic algorithms, small weights are more likely to be pruned but they can be kept; large weights are more likely to be kept but they can be pruned. The slimmest DNN with the best accuracy is selected from the branch. The performance of the proposed pruning algorithm is evaluated by two deterministic datasets and four non-deterministic datasets. Experimental results show that the proposed pruning algorithm generates simpler DNNs while DNN accuracy can be kept, compared to several existing pruning approaches.
- Conference Article
17
- 10.23919/date51398.2021.9473973
- Feb 1, 2021
In-memory computing (IMC) has been demonstrated as a promising technique to significantly improve energy-efficiency for deep neural network (DNN) hardware accelerators. However, designing one involves setting many design variables such as the number of parallel rows to assert, analog-to-digital converter (ADC) at the periphery of memory sub-array, activation/weight precisions of DNNs, etc., which affect energy-efficiency, DNN accuracy, and area. While individual IMC designs have been presented in the literature, they have not investigated this multi-dimensional design optimization. In this paper, to fill this knowledge gap, we present a SRAM-based IMC hardware modeling and optimization framework. A unified systematic study closely models IMC hardware, and investigates how a number of design variables and nonidealities (e.g. device mismatch and ADC quantization) affect the DNN accuracy of IMC design. To maintain high DNN accuracy for the IMC SRAM hardware, it is shown that the number of activated rows, ADC resolution, ADC quantization range, and different sources of variability/noise need to be carefully selected and co-optimized with an underlying DNN algorithm to implement.
- Conference Article
- 10.1109/urtc45901.2018.9244778
- Oct 5, 2018
This work explores a hypothesis for the observation that the accuracy of Deep Neural Networks (DNNs) increases with the depth of the network. The aim of the project is to count the number of exact solutions to a simplified DNN problem. A finite family of DNN functions is defined so that the number of solutions as a function of depth can be counted. Through construction of these DNN solutions, a lower bound and an approximate rate of growth can be found for the number of solutions. This function indicates that the number of solutions grows rapidly with depth, which may offer some incite into why the accuracy of deep neural networks (DNNs) increases with the depth of the network.
- Conference Article
124
- 10.1145/3352460.3358280
- Oct 12, 2019
The effectiveness of deep neural networks (DNN) in vision, speech, and language processing has prompted a tremendous demand for energy-efficient high-performance DNN inference systems. Due to the increasing memory intensity of most DNN workloads, main memory can dominate the system's energy consumption and stall time. One effective way to reduce the energy consumption and increase the performance of DNN inference systems is by using approximate memory, which operates with reduced supply voltage and reduced access latency parameters that violate standard specifications. Using approximate memory reduces reliability, leading to higher bit error rates. Fortunately, neural networks have an intrinsic capacity to tolerate increased bit errors. This can enable energy-efficient and high-performance neural network inference using approximate DRAM devices. Based on this observation, we propose EDEN, the first general framework that reduces DNN energy consumption and DNN evaluation latency by using approximate DRAM devices, while strictly meeting a user-specified target DNN accuracy. EDEN relies on two key ideas: 1) retraining the DNN for a target approximate DRAM device to increase the DNN's error tolerance, and 2) efficient mapping of the error tolerance of each individual DNN data type to a corresponding approximate DRAM partition in a way that meets the user-specified DNN accuracy requirements. We evaluate EDEN on multi-core CPUs, GPUs, and DNN accelerators with error models obtained from real approximate DRAM devices. We show that EDEN's DNN retraining technique reliably improves the error resiliency of the DNN by an order of magnitude. For a target accuracy within 1% of the original DNN, our results show that EDEN enables 1) an average DRAM energy reduction of 21%, 37%, 31%, and 32% in CPU, GPU, and two different DNN accelerator architectures, respectively, across a variety of state-of-the-art networks, and 2) an average (maximum) speedup of 8% (17%) and 2.7% (5.5%) in CPU and GPU architectures, respectively, when evaluating latency-bound neural networks.
- Research Article
72
- 10.1016/j.micpro.2022.104441
- Jan 12, 2022
- Microprocessors and Microsystems
Review of ASIC accelerators for deep neural network
- Conference Article
61
- 10.1109/rtss.2018.00017
- Dec 1, 2018
Modern embedded cyber-physical systems are becoming entangled with the realm of deep neural networks (DNNs) towards increased autonomy. While applying DNNs can significantly improve the accuracy in making autonomous control decisions, a significant challenge is that DNNs are designed and developed on advanced hardware (e.g., GPU clusters), and will not easily meet strict timing requirements if deployed in a resource-constrained embedded computing environment. One interesting characteristic of DNNs is approximation, which can be used to satisfy real-time requirements by reducing DNNs' execution costs with reasonably sacrificed accuracy. In this paper, we propose ApNet, a timing-predictable runtime system that is able to guarantee deadlines of DNN workloads via efficient approximation. Rather than straightforwardly approximating DNNs, ApNet develops a DNN layer-aware approximation approach that smartly explores the trade-off between the approximation degree and the resulting execution reduction on a per-layer basis. To further reduce approximation-induced accuracy loss at runtime, ApNet explores a rather interesting observation that resource sharing and approximation can mutually supplement one another, particularly in a multi-tasking environment. We have implemented and extensively evaluated ApNet on a mix of 8 different DNN configurations on an NVIDIA Jetson TX2. Experimental results show that ApNet can guarantee timing predictability (i.e., meeting all deadlines), while incurring a reasonable accuracy loss. Moreover, accuracy can be improved by up to 8% via a resource sharing increase of 3.5x on average for overlapping DNN layers.
- Book Chapter
6
- 10.1007/978-981-32-9563-6_12
- Jan 1, 2019
Claim prediction is an important process in an automobile insurance industry to prepare the right type of insurance policy for each potential policyholder. The volume of available data to construct the model of the claim prediction is usually large. Nowadays, deep neural networks (DNN) becomes more popular in the machine learning field especially for unstructured data likes image, text, or signal. The DNN model integrates the feature selection into the model in the form of some additional hidden layers. Moreover, DNN is suitable for the large volume of data because of its incremental learning. In this paper, we apply and analyze the accuracy of DNN for the problem of claim prediction which has structured data. First, we show the sensitivity of the hyperparameters on the accuracy of DNN and compare the performance of DNN with standard neural networks. Our simulation shows that the accuracy of DNN is slightly better than the standard neural networks in term of normalized Gini.
- Research Article
30
- 10.1109/access.2020.3022327
- Jan 1, 2020
- IEEE Access
Designing resource-efficient deep neural networks (DNNs) is a challenging task due to the enormous diversity of applications as well as their time-consuming design, training, optimization, and evaluation cycles, especially the resource-constrained embedded systems. To address these challenges, we propose a novel DNN design framework called accuracy-and-performance-aware neural architecture search (APNAS), which can generate DNNs efficiently, as it does not require hardware devices or simulators while searching for optimized DNN model configurations that offer both inference accuracy and high execution performance. In addition, to accelerate the process of DNN generation, APNAS is built on a weight sharing and reinforcement learning-based exploration methodology, which is composed of a recurrent neural network controller as its core to generate sample DNN configurations. The reward in reinforcement learning is formulated as a configurable function to consider the sample DNNs' accuracy and cycle count required to run on a target hardware architecture. To further expedite the DNN generation process, we devise analytical models for cycle count estimation instead of running millions of DNN configurations on real hardware. We demonstrate that these analytical models are highly accurate and provide cycle count estimates identical to those of a cycle-accurate hardware simulator. Experiments that involve quantitatively varying hardware constraints demonstrate that APNAS requires only 0.55 graphics processing unit (GPU) days on a single Nvidia GTX 1080Ti GPU to generate DNNs that offer an average of 53% fewer cycles with negligible accuracy degradation (on average 3%) for image classification compared to state-of-the-art techniques.
- Conference Article
3
- 10.1109/percom53586.2022.9762400
- Mar 21, 2022
We propose a new concept called Weight Separation of deep neural networks (DNNs), which enables memory-efficient and accurate deep multitask learning on a memory-constrained embedded system. The goal of weight separation is to achieve extreme packing of multiple heterogeneous DNNs into the limited memory of the system while ensuring the prediction accuracy of the constituent DNNs at the same time. The proposed approach separates the DNN weights into two types of weight-pages consisting of a subset of weight parameters, i.e., shared and exclusive weight-pages. It optimally distributes the weight-pages into two levels of the system memory hierarchy and stores them separately, i.e., the shared weight-pages in primary (level-1) memory (e.g., RAM) and the exclusive weight-pages in secondary (level-2) memory (e.g., flask disk or SSD). First, to reduce the memory usage of multiple DNNs, less critical weight parameters are identified and overlapped onto the shared weight-pages that are deployed in the limited space of the primary (main) memory. Next, to retain the prediction accuracy of multiple DNNs, the essential weight parameters that play a critical role in preserving prediction accuracy are stored intact in the plentiful space of secondary memory storage in the form of exclusive weight-pages without overlapping. We implement two real systems applying the proposed weight separation: 1) a microcontroller-based multitask IoT system that performs multitask learning of 10 scaled-down DNNs by separating the weight parameters into FRAM and flash disk, and 2) an embedded GPU system that performs multitask learning of 10 state-of-the-art DNNs, separating the weight parameters into GPU RAM and eMMC. Our evaluation shows that memory efficiency, prediction accuracy, and execution time of deep multitask learning improve up to 5.9x, 2.0%, and 13.1x, respectively, without any modification of DNN models.
- Conference Article
4
- 10.1109/dsc47296.2019.8937635
- Nov 1, 2019
Training Deep Neural Network (DNN) models often require significant computational resources due to the large dataset sizes and a huge number of parameters to be optimized. A cloud-based approach may be utilized to accommodate such resource needs with flexibility and efficiency. But, protecting data privacy is a challenge in such approaches. Most of the encryption-based approaches for providing privacy incurs substantial overheads. However, in many instances, only part of data information needs to be protected, and the level of privacy is often dependent on the application requirements. Various types of image filtering techniques are utilized to generate distorted DNN datasets with application-specific privacy requirements satisfied. In general, high distortion level provides strong protection on data privacy but degrades the DNN accuracy. To find the appropriate type and level of image filtering prior to the training process, we identify an image similarity metric that can be used as a DNN accuracy predictor as well as the distortion level indicator. Furthermore, to improve the DNN accuracy of highly distorted datasets, we propose a privacy-preserving federated-cloud DNN training/classification on multiple distorted datasets. Each cloud trains an independent DNN model with a different image filtering algorithm, and then the client combines and utilizes the multiple models to obtain a well-performing model. Experiments were conducted to validate the effectiveness of the proposed schemes.