Row-Streaming Dataflow Using a Chaining Buffer and Systolic Array+ Structure

Hweesoo Kim,Jung Ho Ahn,Sunjung Lee,Jaewan Choi

doi:10.1109/lca.2021.3054371

Hweesoo Kim, Jung Ho Ahn + Show 2 more

Open Access

https://doi.org/10.1109/lca.2021.3054371

Copy DOI

Journal: IEEE Computer Architecture Letters	Publication Date: Jan 1, 2021
Citations: 12	License type: CC BY 4.0

Affiliation: Seoul National University

Abstract

Convolutional Neural Networks (CNNs) are widely used to solve complex problems in various fields, such as image recognition, image classification, and video analysis. Convolutional (CONV) layers are the most computational part of the CNN inference; various architectures have been proposed to process it efficiently. Among those, a systolic array consists of a 2D array of processing elements, which handle GEneral Matrix Multiplication (GEMM) with high efficiency. However, to process a CONV layer as a GEMM type, image-to-column (im2col) processing, which is also called lowering, is required per layer, necessitating a larger on-chip memory and a considerable amount of repetitive on-chip memory access. In this letter, we propose a systolic array+ (SysAr+) structure augmented with a chaining buffer and a row-streaming dataflow that can maximize data reuse without the im2col pre-process in the CONV layer and the repetitive access from the large on-chip memory. By applying the proposed method to the 3×3 CONV layers, we reduce the energy consumption by up to 19.7 percent in ResNet and 37.4 percent in DenseNet with an area overhead of 1.54 percent in SysAr+, and we improve the performance by up to 32.4 percent in ResNet and 12.1 percent in DenseNet.

Highlights

TO CNN ACCELERATORSVARIOUS CNN accelerators [1], [5], [8] proposed as Convolutional Neural Networks (CNNs) are widely used for understanding visual information
We synthesized logic components and registers such as multipliers, MACs, N-tap Shift register Buffers (NTSBs), and Row Buffers (RBs) based on a 40 nm technology and evaluated the SRAM components such as unified buffer (UB) and ACC using CACTI [10]
We used the layers of ResNet-50 and DenseNet-121, especially 3Â3 CONV, as the target layers

Summary

INTRODUCTION

VARIOUS CNN accelerators [1], [5], [8] proposed as Convolutional Neural Networks (CNNs) are widely used for understanding visual information. The SysAr [xÂy] structure can be largely divided into three parts: a 2D [xÂy] processing element (PE) array, a unified buffer (UB), and accumulators (ACC). UB serves to store input and output for data reuse. A CONV layer consists of repetitive MAC operations and has the characteristic that the possibility of data reuse is very high. When processing the CONV layer in SysAr, repeated access is required to the same data in the on-chip memory, or a larger on-chip memory is required to duplicate the input data by KWÂKH. Both approaches increase the energy consumed in the on-chip memory.

ENERGY-EFFICIENCY CHALLENGES OF PROCESSING CONV LAYERS IN A SYSAR

EVALUATION

CONCLUSION