Implementing a large application(LSTM) on the multi-FPGA system: Flow-in-Cloud

Yugo Yamauchi,Kazusa Musha,Hideharu Amano

doi:10.1109/coolchips.2019.8721333

Abstract

In order to cope with computation cost and energy required for the recent deep learning technology, domain specific systems have been used in the cloud computing so as to be shared by many application developers efficiently. Although GPU(Graphics Processing Unit)s and more specialized systems like TPU (Tensor Processing Unit)s have been popularly utilized, FPGAs have been receiving an attention especially for their power efficiency and flexibility. Since energy efficiency is one of the most important issues in recent cloud computing, a lot of researches to use FPGAs in the cloud have been asserted[3], and commercial systems including Amazon's F1 instance are available. However, the performance improvement in the FPGA is limited by the upper limit of the FPGA resource, even an FPGA in the cloud. Thus, in order to implement a large deep layer learning application, we must use an expensive high-end FPGA or adopt a lightweight algorithm sacrificing throughput and accuracy. To deal with this problem, in the project “Power-saving AI engine and platform with heterogeneous engine integrated cloud” supported by NEDO started to develop a large-scale AI system called Flow-in-Cloud (FiC)[4]. FiC is consisting of a number of middle-scale economical FPGAs interconnected with high communication bandwidth network. From an HLS (High Level Synthesis) programmer, a lot of FPGAs can be handled as if they were a single large FPGA. The programmer can implement a large-scale deep learning model without caring resources on a single FPGA. FiC is managed by the Flow-OS, and shared efficiently by many users. Although FiC is designed to build a heterogeneous computing system, the current prototype is consisting of multiple FPGA boards called “FiC-SW” each of which provides both the switching and computing capabilities. Here, as a case study of such an energy efficient multi-FPGA board computing, we implemented the inference part of Long Short Term Memory (LSTM) [1].

Full Text