Abstract

As embedded systems, such as smartphones with limited resources, have become increasingly popular, active research has recently been conducted on performing on-device deep learning in such systems. Therefore, in this study, we propose a deep learning framework that is specialized for embedded systems with limited resources, the operation processing structure of which differs from that of standard PCs. The proposed framework supports an OpenCL-based accelerator engine for accelerator deep learning operations in various embedded systems. Moreover, the parallel processing performance of OpenCL is maximized through an OpenCL kernel that is optimized for embedded GPUs, and the structural characteristics of embedded systems, such as unified memory. Furthermore, an on-device optimizer for optimizing the performance in on-device environments, and model converters for compatibility with conventional frameworks, are provided. The results of a performance evaluation show that the proposed on-device framework outperformed conventional methods.

Highlights

  • Deep neural networks (DNNs) have been widely adopted in various fields, such as in image and character recognition and object detection [1,2,3,4,5,6,7,8,9,10]

  • The ACL is only operable in ARM central processing unit (CPU) and graphics processing units (GPUs); Caffe or TensorFlow models can be used when ArmNN [34] is used, but ACL alone cannot be linked with conventional deep learning frameworks

  • The accelerator engine consists of OpenCL-based BLAS (CSblas), which is optimized for embedded GPUs, and a DNN-accelerated library

Read more

Summary

Introduction

Deep neural networks (DNNs) have been widely adopted in various fields, such as in image and character recognition and object detection [1,2,3,4,5,6,7,8,9,10]. In this study, we propose CitiusSynapse as a deep learning framework that is specialized for embedded systems. The proposed framework performs deep learning operations which are based on OpenCL [21] to accelerate deep learning operations. Sci. 2021, 11, 11570 learning operations which are based on OpenCL [21] to accelerate deep learning operawithin various embedded systems. Structural characteristics ofas embedded systems, such as sharedCPUs unified memory our framework provides on-device inference performance optimizer forperformance embedded. Our framework was compared with conventional by deep performance evaluation in an embedded board, equipped with an. The deep learning core executes deep conjunction with the accelerator engine. Section shows the superiority of the proposed framework when compared the conventional deep learning through athrough performance framework whentocompared to the conventional deepframework learning framework a perevaluation

OpenCL
Accelerated Libraries for Deep Learning
Deep Learning Frameworks
Deep Learning Core and Accelerator Engine
Data Structure with Unified Memory shown in inFigure
Comparison
Section 3.2.2.
Structure
Accelerator Engine
CSblas
Asynchronous queue execution
On-Device Optimizer for Inference
NDRange Optimizer
Quantization Optimizer
Model Converter for Compatibility
Experimental Setup
64 GB eMMC
Comparison of Inference Times
Comparison of Memory Usage for Inference
Findings
Conclusions and Discussion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call