Towards the Efficient Multi-Platform Execution of Deep Neural Networks

Hector Gerardo Munoz Hernandez

doi:10.1109/fpl53798.2021.00056

Abstract

Modern Systems-on-Chip (SoCs) based on Field-Programmable Gate Arrays (FPGAs) offer users significant flexibility in deciding the best approach to implement Convolutional Neural Networks (CNNs): a) in a fixed, hardwired general-purpose processor, or b) using the programmable logic to implement application-specific processing cores. This thesis proposes an automated toolflow that maps Tensorflow/Keras pre-trained models into different possible platforms: ARM core using the Neon extension and a soft-core GPU for FPGA. CNNs are heterogeneous, meaning that convolutional layers, for example, will have different resource access and computation requirements as the Fully Connected (FC) layers, hinting that different hardware may be optimal for different layer types. After evaluating the performance of different CNNs executed in an ARM Cortex-A9 and the soft-core GPU, it was found that convolutional layers were 5.9x faster in the soft-core GPU than in the ARM core. On the other hand, FC layers were executed faster in the ARM core. As a result, this work proposes a collaborative execution of CNNs using these two platforms together, running the convolutional and maxpooling layers in the soft-core GPU and the FC layers in the ARM core, achieving a speedup of 2x against using only the ARM core. Consequently, this thesis is exploring other mixes of hardware platforms or even using partial reconfiguration techniques.

Full Text