A Generalized Software Framework for Deep Neural Network Inference in Space Applications
A Generalized Software Framework for Deep Neural Network Inference in Space Applications
- Book Chapter
3
- 10.1007/978-981-19-7615-5_5
- Jan 1, 2023
Deep learning is an effective ML algorithm capable of learning and extracting deep representation of data with utmost accuracy. The outstanding performance of deep learning models comes with a series of network layers that demand high computational energy and add latency overhead to the system. Inference of a deep neural network (DNN) completes and delivers output after processing all the network layers irrespective of the input pattern. The complexity of the deep neural network prohibits its usage in energy-constrained low latency real-time applications. A possible solution is multi-exit neural networks that introduce multiple exit branches to the standard neural networks. These early exit neural networks deliver output from their intermediate layers through exit points based on specific confidence criteria. The majority of the input sample can be processed at the initial layers of the network, while more complex input samples can be forwarded further for processing to the remaining layers of the network. This paper analyzes the performance of early exit deep neural networks against their confidence criteria and the number of branches. This study also evaluates the classification accuracy among exit branches. For analysis, implements an object detection application using the early exit MobiletNetV2 neural network and Caltech-256 as datasets. The experiments prove that early exit DNN can speed up the inference process with acceptable accuracy, and the selection of confidence criteria has a significant impact on the system performance.
- Research Article
69
- 10.1109/tmc.2021.3125949
- May 1, 2023
- IEEE Transactions on Mobile Computing
Mobile Edge Computing (MEC) has emerged as a promising paradigm catering to overwhelming explosions of mobile applications, by offloading the compute-intensive tasks to an MEC network for processing. The surging of deep learning brings new vigor and vitality to shape the prospect of intelligent Internet of Things (IoT), and edge intelligence arises to provision real-time deep neural network (DNN) inference services for users. In this paper, we study a novel delay-aware DNN inference throughput maximization problem by accelerating each DNN inference through jointly exploring DNN partitioning and multi-thread parallelism. Specifically, we consider the problem under both offline and online request arrival settings: a set of DNN inference requests is given in advance, and a sequence of DNN inference requests arrives one by one without the knowledge of future arrivals, respectively. We first show that the defined problems are NP-hard. We then devise a novel constant approximation algorithm for the problem under the offline setting. We also propose an online algorithm with a provable competitive ratio for the problem under the online setting. We finally evaluate the performance of the proposed algorithms through experimental simulations. Experimental results demonstrate that the proposed algorithms are promising.
- Research Article
1
- 10.1016/j.neucom.2024.128628
- Sep 19, 2024
- Neurocomputing
ShaderNN: A lightweight and efficient inference engine for real-time applications on mobile GPUs
- Conference Article
7
- 10.1109/lcn52139.2021.9524928
- Oct 4, 2021
Mobile Edge Computing (MEC) has emerged as a promising paradigm catering to overwhelming explosions of mobile applications, by offloading the compute-intensive tasks to an MEC network for processing. The surging of deep learning brings new vigor and vitality to shape the prospect of intelligent Internet of Things (IoT), and edge intelligence arises to provision real-time deep neural network (DNN) inference services for users. To accelerate the processing of the DNN inference of a request in an MEC network, the DNN inference model usually can be partitioned into two connected parts: one part is processed on the local IoT device of the request; and another part is processed on a cloudlet (server) in the MEC network. Also, the DNN inference can be further accelerated by allocating multiple threads of the cloudlet in which the request is assigned.In this paper, we study a novel delay-aware DNN inference throughput maximization problem with the aim to maximize the number of delay-aware DNN service requests admitted, by accelerating each DNN inference through jointly exploring DNN model partitioning and multi-thread parallelism of DNN inference. To this end, we first show that the problem is NP-hard. We then devise a constant approximation algorithm for it. We finally evaluate the performance of the proposed algorithm through experimental simulations. Experimental results demonstrate that the proposed algorithm is promising.
- Book Chapter
1
- 10.1002/9781119593584.ch2
- Apr 16, 2021
This chapter presents the multi-tier computing network architecture for intelligent internet of things (IoT) applications along with two important frameworks that is cost aware task scheduling and fog as a service technology. It describes two intelligent application scenarios and the corresponding technical solutions as illustrative case studies of the multi-tier computing network architecture. The chapter proposes an on-site cooperative deep neural network (DNN) inference framework, which is based on edge computing to execute DNN inference tasks with low latency and high accuracy for industrial IoT applications, thus meeting the requirements on service delay and reliability. It also proposes a three-tier collaborative computing and service framework, which is based on fog computing to support dynamic task offloading and service composition in simultaneous localization and mapping for a robot swarm system, which requires timely data sharing and joint processing among multiple moving robots. The chapter introduces Boomerang, an on-demand cooperative inference framework.
- Conference Article
223
- 10.1145/3373376.3378534
- Mar 9, 2020
With the emergence of a spectrum of high-end mobile devices, many applications that formerly required desktop-level computation capability are being transferred to these devices. However, executing the inference of Deep Neural Networks (DNNs) is still challenging considering high computation and storage demands, specifically, if real-time performance with high accuracy is needed. Weight pruning of DNNs is proposed, but existing schemes represent two extremes in the design space: non-structured pruning is fine-grained, accurate, but not hardware friendly; structured pruning is coarse-grained, hardware-efficient, but with higher accuracy loss. In this paper, we introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency. In other words, our method achieves the best of both worlds, and is desirable across theory/algorithm, compiler, and hardware levels. The proposed PatDNN is an end-to-end framework to efficiently execute DNN on mobile devices with the help of a novel model compression technique (pattern-based pruning based on extended ADMM solution framework) and a set of thorough architecture-aware compiler- and code generation-based optimizations (filter kernel reordering, compressed weight storage, register load redundancy elimination, and parameter auto-tuning). Evaluation results demonstrate that PatDNN outperforms three state-of-the-art end-to-end DNN frameworks, TensorFlow Lite, TVM, and Alibaba Mobile Neural Network with speedup up to 44.5x, 11.4x, and 7.1x, respectively, with no accuracy compromise. Real-time inference of representative large-scale DNNs (e.g., VGG-16, ResNet-50) can be achieved using mobile devices.
- Conference Article
1
- 10.1109/isvlsi54635.2022.00073
- Jul 1, 2022
The sensor subsystem is a crucial component in a Deep Neural Network (DNN) inference framework. However, the high amount of sensor data being generated manifests as an energy bottleneck in resource-constrained edge devices. Towards this end, we propose SeNNse, a novel sensor compression methodology that optimizes the energy requirement of sensor subsystems, which involves a two-step approach of subsampling and subsequent supersampling of the sensor images via inter-polation. However, such compressed sensor inputs may result in substantial performance degradation in the presence of bit-flip faults manifested in DNN accelerators (as shown in this paper). Such faults occur frequently in semiconductor device memory due to variegated reasons, ranging from impingement of high-energy particles to structural deformities. We evaluate our approach on Multilayer Perceptrons (MLP), trained on MNIST, EMNIST, and CIFAR-10 datasets. Our proposed SeNNse frame-work furnishes maximum energy savings of 62.1%, with a negligible reduction in classification accuracy. However, our results also indicate larger performance degradation, of up to 21.56%, due to bit-flip faults for such compressed inputs, which is mainly attributed to the concise set of input activations being fed to the neural networks during inference.
- Conference Article
1
- 10.1109/percomworkshops51409.2021.9431004
- Mar 22, 2021
This artifact is a guideline how the Edgecaffe framework, presented in [1], can be used. Edgecaffe is an open-source Deep Neural Network framework for efficient multi-network inference on edge devices. This framework enables the layer by layer execution and fine-grained control during inference of Deep Neural Networks. Edgecaffe is created to give more fine grained-control over the execution during inference than offered by the original code of Caffe [2]. Edgecaffe made it possible for Masa to outperform Deepeye [3] and normal bulk execution. Besides the core implementation of Edgecaffe, the repository holds additional tools, Queue Runner and ModelSplitter, that make more convenient to run experiments and prepare newly trained networks.
- Research Article
5
- 10.3390/app9081669
- Apr 23, 2019
- Applied Sciences
Deep neural networks (DNNs) have been quite successful in solving many complex learning problems. However, DNNs tend to have a large number of learning parameters, leading to a large memory and computation requirement. In this paper, we propose a model compression framework for efficient training and inference of deep neural networks on embedded systems. Our framework provides data structures and kernels for OpenCL-based parallel forward and backward computation in a compressed form. In particular, our method learns sparse representations of parameters using ℓ 1 -based sparse coding while training, storing them in compressed sparse matrices. Unlike the previous works, our method does not require a pre-trained model as an input and therefore can be more versatile for different application environments. Even though the use of ℓ 1 -based sparse coding for model compression is not new, we show that it can be far more effective than previously reported when we use proximal point algorithms and the technique of debiasing. Our experiments show that our method can produce minimal learning models suitable for small embedded devices.
- Research Article
5
- 10.1002/stvr.1873
- Feb 1, 2024
- Software Testing, Verification and Reliability
SummarySafety‐critical applications, such as autonomous vehicles, healthcare, and space applications, have witnessed widespread deployment of deep neural networks (DNNs). Inherent algorithmic inaccuracies have consistently been a prevalent cause of misclassifications, even in modern DNNs. Simultaneously, with an ongoing effort to minimize the footprint of contemporary chip design, there is a continual rise in the likelihood of transient hardware faults in deployed DNN models. Consequently, researchers have wondered the extent to which these faults contribute to DNN misclassifications compared to algorithmic inaccuracies. This article delves into the impact of DNN misclassifications caused by transient hardware faults and intrinsic algorithmic inaccuracies in safety‐critical applications. Initially, we enhance a cutting‐edge fault injector,TensorFI, for TensorFlow applications to facilitate fault injections on modern DNN non‐sequential models in a scalable manner. Subsequently, we analyse the DNN‐inferred outcomes based on our defined safety‐critical metrics. Finally, we conduct extensive fault injection experiments and a comprehensive analysis to achieve the following objectives: (1) investigate the impact of different target class groupings on DNN failures and (2) pinpoint the most vulnerable bit locations within tensors, as well as DNN layers accountable for the majority of safety‐critical misclassifications. Our findings regarding different grouping formations reveal that failures induced by transient hardware faults can have a substantially greater impact (with a probability up to 4 higher) on safety‐critical applications compared to those resulting from algorithmic inaccuracies. Additionally, our investigation demonstrates that higher order bit positions in tensors, as well as initial and final layers of DNNs, necessitate prioritized protection compared to other regions.
- Research Article
37
- 10.1109/tpds.2022.3232715
- Mar 1, 2023
- IEEE Transactions on Parallel and Distributed Systems
GPUs are essential to accelerating the latency-sensitive deep neural network (DNN) inference workloads in cloud datacenters. To fully utilize GPU resources, <i>spatial sharing</i> of GPUs among co-located DNN inference workloads becomes increasingly compelling. However, GPU sharing inevitably brings <i>severe performance interference</i> among co-located inference workloads, as motivated by an empirical measurement study of DNN inference on EC2 GPU instances. While existing works on guaranteeing inference performance service level objectives (SLOs) focus on either <i>temporal sharing</i> of GPUs or <i>reactive</i> GPU resource scaling and inference migration techniques, how to <i>proactively</i> mitigate such severe performance interference has received comparatively little attention. In this paper, we propose <i>iGniter</i> , an <i>interference-aware</i> GPU resource provisioning framework for cost-efficiently achieving predictable DNN inference in the cloud. <i>iGniter</i> is comprised of two key components: (1) a <i>lightweight</i> DNN inference performance model, which leverages the system and workload metrics that are practically accessible to capture the performance interference; (2) A <i>cost-efficient</i> GPU resource provisioning strategy that <i>jointly</i> optimizes the GPU resource allocation and adaptive batching based on our inference performance model, with the aim of achieving predictable performance of DNN inference workloads. We implement a prototype of <i>iGniter</i> based on the NVIDIA Triton inference server hosted on EC2 GPU instances. Extensive prototype experiments on four representative DNN models and datasets demonstrate that <i>iGniter</i> can guarantee the performance SLOs of DNN inference workloads with practically acceptable runtime overhead, while saving the monetary cost by up to <inline-formula><tex-math notation="LaTeX">$25\%$</tex-math></inline-formula> in comparison to the state-of-the-art GPU resource provisioning strategies.
- Conference Article
3
- 10.1145/3543873.3587370
- Apr 30, 2023
Edge computing and cloud computing have been utilized in many AI applications in various fields, such as computer vision, NLP, autonomous driving, and smart cities. To benefit from the advantages of both paradigms, we introduce HiDEC, a hierarchical deep neural network (DNN) inference framework with three novel features. First, HiDEC enables the training of a resource-adaptive DNN through the injection of multiple early exits. Second, HiDEC provides a latency-aware inference scheduler, which determines which input samples should exit locally on an edge device based on the exit scores, enabling inference on edge devices with insufficient resources to run the full model. Third, we introduce a dual thresholding approach allowing both easy and difficult samples to exit early. Our experiments on image and text classification benchmarks show that HiDEC significantly outperforms existing solutions.
- Research Article
110
- 10.1109/tvt.2021.3068255
- Mar 24, 2021
- IEEE Transactions on Vehicular Technology
Performing deep neural network (DNN) inference in real time requires excessive network resources, which poses a big challenge to the resource-limited industrial Internet of things (IIoT) networks. To address the challenge, in this paper, we introduce an end-edge-cloud orchestration architecture, in which the inference task assignment and DNN model placement are flexibly coordinated. Specifically, the DNN models, trained and pre-stored in the cloud, are properly placed at the end and edge to perform DNN inference. To achieve efficient DNN inference, a multi-dimensional resource management problem is formulated to maximize the average inference accuracy while satisfying the strict delay requirements of inference tasks. Due to the mix-integer decision variables, it is difficult to solve the formulated problem directly. Thus, we transform the formulated problem into a Markov decision process which can be solved efficiently. Furthermore, a deep reinforcement learning based resource management scheme is proposed to make real-time optimal resource allocation decisions. Simulation results are provided to demonstrate that the proposed scheme can efficiently allocate the available spectrum, caching, and computing resources, and improve average inference accuracy by 31.4$\%$ compared with the deep deterministic policy gradient benchmark.
- Conference Article
3
- 10.1109/iedm45625.2022.10019564
- Dec 3, 2022
Hardware accelerators that exploit analog in-memory computing offer an energy-efficient edge deployment solution for machine learning algorithms. We give an overview of the device requirements and hardware-software co-design principles for these systems to achieve efficient and accurate deep neural network (DNN) inference. We designed and fabricated a 40nm test chip with a $1024 \times 1024$ SONOS (siliconoxide-nitride-oxide-silicon) charge trapping memory array for DNN inference. Operating the SONOS memory in the subthreshold regime suppresses the effects of device variability on algorithm accuracy. We experimentally demonstrate accurate DNN inference using the test chip on CIFAR-100 image classification and project a chip-level efficiency of >50 TOPS/W for the SONOS inference accelerator, a $10 \times$ advantage over state-of-the-art digital inference accelerators.
- Conference Article
21
- 10.1145/3460120.3484797
- Nov 12, 2021
We introduce COINN - an efficient, accurate, and scalable framework for oblivious deep neural network (DNN) inference in the two-party setting. In our system, DNN inference is performed without revealing the client's private inputs to the server or revealing server's proprietary DNN weights to the client. To speedup the secure inference while maintaining a high accuracy, we make three interlinked innovations in the plaintext and ciphertext domains: (i) we develop a new domain-specific low-bit quantization scheme tailored for high-efficiency ciphertext computation, (ii) we construct novel techniques for increasing data re-use in secure matrix multiplication allowing us to gain significant performance boosts through factored operations, and (iii) we propose customized cryptographic protocols that complement our optimized DNNs in the ciphertext domain. By co-optimization of the aforesaid components, COINN brings an unprecedented level of efficiency to the setting of oblivious DNN inference, achieving an end-to-end runtime speedup of 4.7×14.4× over the state-of-the-art. We demonstrate the scalability of our proposed methods by optimizing complex DNNs with over 100 layers and performing oblivious inference in the Billion-operation regime for the challenging ImageNet dataset. Our framework is available at https://github.com/ACESLabUCSD/COINN.git.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.