Using HW/SW Codesign for Deep Neural Network Hardware Accelerator Targeting Low-Resources Embedded Processors

Erez Manor,Shlomo Greenberg

doi:10.1109/access.2022.3153119

Abstract

The usage of RISC-based embedded processors, aimed at low cost and low power, is becoming an increasingly popular ecosystem for both hardware and software development. High performance yet low power embedded processors may be attained via the use of hardware acceleration and Instruction Set Architecture (ISA) extension. Efficient mapping of the computational load onto hardware and software resources is a key challenge for performance improvement while still keeping low power and area. Furthermore, exploring performance at an early stage of the design makes this challenge more difficult. Potential hardware accelerators can be identified and extracted from the high-level source code by graph analysis to enumerate common patterns. A scheduling algorithm is used to select an optimized sub-set of accelerators to meet real-time constraints. This paper proposes an efficient hardware/software codesign partitioning methodology applied to high-level programming language at an early stage of the design. The proposed methodology is based on graph analysis. The applied algorithms are presented by a synchronous directed acyclic graph. A constraint-driven method and unique scheduling algorithm are used for graph partitioning to obtain overall speedup and area requirements. The proposed hardware/software partitioning methodology has been evaluated for MLPerf Tiny benchmark. Experimental results demonstrate a speedup of up to 3 orders of magnitude compared to software-only implementation. For example, the resulting runtime for the KWS (Keyword Spotting) software implementation is reduced from 206 sec to only 181ms using the proposed hardware-acceleration approach.

Highlights

I N the last years, the complexity of the embedded platform, such as Internet-of-Things (IoT) devices, has been increasing steadily with the conflicting requirements for high performance and real-time capabilities versus the minimal amount of power and size
The MLPerf Tiny benchmark [41], [42] is used for runtime and code-size comparison. This benchmark consists of three sequential models for machine learning tasks: (a) Keyword Spotting (KWS), which uses a neural network that detects keywords from an audio spectrogram, (b) Visual Wake Words (VWW), a binary image classification task for determining the presence of a person in an image, and (c) Anomaly Detection (AD), which uses a neural network to identify abnormalities in machine operating sounds
To further evaluate the proposed methodology, we examined the common TensorFlow Lite for Micro-controllers (TFLM) model for (a) Google network for ’Gesture Recognition Magic Wand’ (GRMW) that was trained to detect wand gestures [43], and (b) an MNIST network used for Handwritten Digit Recognition (HDR) [44]

Summary

INTRODUCTION

I N the last years, the complexity of the embedded platform, such as Internet-of-Things (IoT) devices, has been increasing steadily with the conflicting requirements for high performance and real-time capabilities versus the minimal amount of power and size. A common approach for the acceleration of an application using an extensible processor usually follows the following stages [6]: (1) develop the algorithm in a high-level programming language (e.g., Matlab, Python); (2) translate the source code application into lowerlevel programming language (e.g., C), (3) compile the code to the appropriate target hardware machine, and evaluate performance and energy efficiency. We propose an efficient methodology for hardware/software partitioning applied to high-level programming language at an early stage of the design. We suggest a unique framework that is based on the proposed methodology to analyze a given source code (in high-level), extract set of hardware accelerators, and implement them into a custom micro-architecture model.

BACKGROUND

THE PROPOSED APPROACH

PROBLEM FORMULATION

GRAPH SCHEDULING

EXPERIMENTAL AND RESULTS

CONCLUSIONS

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2022
Citations: 5	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Using HW/SW Codesign for Deep Neural Network Hardware Accelerator Targeting Low-Resources Embedded Processors

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

CORDIC Hardware Acceleration Using DMA-Based ISA Extension
Erez Manor ... Avrech Ben-David
Journal of Low Power Electronics and Applications | VOL. 12
Erez Manor, et. al.Erez Manor ... Avrech Ben-David
15 Jan 2022
Journal of Low Power Electronics and Applications | VOL. 12

An experimental evaluation of extreme learning machines on several hardware devices
Liang Li ... Qi Zhang
Neural Computing and Applications | VOL. 32
Liang Li, et. al.Liang Li ... Qi Zhang
12 Sep 2019
Neural Computing and Applications | VOL. 32

A Review of Hardware Acceleration for Computational Genomics
Srinivas Aluru ... Nagakishore Jammula
IEEE Design & Test | VOL. 31
Srinivas Aluru, et. al.Srinivas Aluru ... Nagakishore Jammula
01 Feb 2014
IEEE Design & Test | VOL. 31

Polystore++: Accelerated Polystore System for Heterogeneous Workloads
Rekha Singhal ... Luigi Nardi
-
Rekha Singhal, et. al.Rekha Singhal ... Luigi Nardi
01 Jul 2019
01 Jul 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Using HW/SW Codesign for Deep Neural Network Hardware Accelerator Targeting Low-Resources Embedded Processors

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access