Implementation of Autoencoders with Systolic Arrays through OpenCL

Rafael Gadea-Gironés,Jose Monzó-Ferrer,Ricardo Colom-Palero,Vicente Herrero-Bosch

doi:10.3390/electronics10010070

Rafael Gadea-Gironés, Jose Monzó-Ferrer + Show 2 more

Open Access

https://doi.org/10.3390/electronics10010070

Copy DOI

Journal: Electronics	Publication Date: Jan 3, 2021
Citations: 4	License type: CC BY 4.0

Affiliation: Universitat Politècnica de València

Abstract

In the world of algorithm acceleration and the implementation of deep neural networks’ recall phase, OpenCL based solutions have a clear tendency to produce perfectly adapted kernels in graphic processor unit (GPU) architectures. However, they fail to obtain the same results when applied to field-programmable gate array (FPGA) based architectures. This situation, along with an enormous advance in new GPU architectures, makes it unfeasible to defend an acceleration solution based on FPGA, even in terms of energy efficiency. Our goal in this paper is to demonstrate that multikernel structures can be written based on classic systolic arrays in OpenCL, trying to extract the most advanced features of FPGAs without having to resort to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL. This OpenCL methodology is based on the intensive use of channels (IntelFPGA extension of OpenCL) for the communication of both data and control and on the refinement of the OpenCL libraries using register transfer logic (RTL) code to improve the performance of the implementation of the base and activation functions of the neurons and, above all, to reflect the importance of adequate communication between the layers when implementing neuronal networks.

Highlights

The field of learning machines and, of deep neural networks is currently rife with acceleration environments
The Arria 10 family was introduced as the technological answer of the field-programmable gate array (FPGA) world to manage floating point operations and that could compete with graphic processor unit (GPU)
It is important to demonstrate in our article that the digital signal processors (DSPs) of this technological family can be handled more efficiently than the automatic inference performed by the OpenCL base compiler

Summary

Introduction

The field of learning machines and, of deep neural networks is currently rife with acceleration environments Those try to bring electronic technologies of a certain complexity closer to engineers and researchers who make use of powerful and versatile artificial intelligence (AI) frameworks such as Tensorflow, Caffe, and PyTorch. An evaluation of different available electronic technologies is required to efficiently perform these tasks This is the point where different implementation options such as application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) are able to provide competitive solutions compared to today’s predominant central processing units (CPUs) and graphical processing units (GPUs). A second wave of applications and languages in search of fast implementation and optimal results has appeared These applications focus on the inference phase of artificial neural networks. OpenVINO and the Xilinx Machine Learning Suite provide direct implementation for FPGA and GPU based solutions, but on the other hand, oneAPI aims at providing a new language for machine learning hardware design

Objectives

Results

Discussion

Conclusion