Abstract

Edge devices are becoming smarter with the integration of machine learning methods, such as deep learning, and are therefore used in many application domains where decisions have to be made without human intervention. Deep learning and, in particular, convolutional neural networks (CNN) are more efficient than previous algorithms for several computer vision applications such as security and surveillance, where image and video analysis are required. This better efficiency comes with a cost of high computation and memory requirements. Hence, running CNNs in embedded computing devices is a challenge for both algorithm and hardware designers. New processing devices, dedicated system architectures and optimization of the networks have been researched to deal with these computation requirements. In this paper, we improve the inference execution times of CNNs in low density FPGAs (Field-Programmable Gate Arrays) using fixed-point arithmetic, zero-skipping and weight pruning. The developed architecture supports the execution of large CNNs in FPGA devices with reduced on-chip memory and computing resources. With the proposed architecture, it is possible to infer an image in AlexNet in 2.9 ms in a ZYNQ7020 and 1.0 ms in a ZYNQ7045 with less than 1% accuracy degradation. These results improve previous state-of-the-art architectures for CNN inference.

Highlights

  • Artificial intelligence associated with image classification [1] is largely used in computer vision applications improving computer vision tasks, such as image classification, object detection, and image segmentation

  • This paper proposes a very efficient architecture that considers zero-skipping, dynamic pruning, block pruning, fixed-point representations and image batch to be implemented in low density FPGAs for smart embedded systems

  • The architectural optimizations proposed in this paper are applied to a baseline architecture that implements large convolutional neural network (CNN) in low density FPGAs considering only 8-bit fixed-point representation format following the ideas of omi [7]

Read more

Summary

Introduction

Artificial intelligence associated with image classification [1] is largely used in computer vision applications improving computer vision tasks, such as image classification, object detection, and image segmentation. Several other CNNs were proposed in the last years, some regular and some irregular with layers different from the usual convolutional and fully connected layers Running any of these networks in an embedded system with strict performance, memory and energy constraints is a challenge because of the high number of weights and operations. This paper improves the baseline architecture with the following techniques: Zero skipping in the convolutional layers where multiplication with zero valued activations are skipped; Dynamic zeroing of activations in convolutional layers; and Coarse pruning of fully connected layers where blocks of redundant weights are cut reducing the memory size required to store them and the number of operations.

Related Work
Convolutional Neural Networks
Baseline Architecture for CNN Inference
PE Clusters
Feature Map Memory
Result
Zero-Skipping and Dynamic Pruning of Activations
Pruning of Weights in Fully Connected Layers
Designing with the Proposed Architecture for Best Performance
Performance Model
Area Model
Model Based Design
Results
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call