DepFiN: A 12-nm Depth-First, High-Resolution CNN Processor for IO-Efficient Inference

Koen Goetschalckx,Marian Verhelst,Fengfeng Wu

doi:10.1109/jssc.2022.3210591

Abstract

Applying convolutional neural networks (CNNs) on high-resolution images leads to very large intermediate feature maps (FMs), which dominate the memory traffic. Processing in the classical layer-by-layer order creates the requirement to store the complete FMs at once, when moving from one layer to the next. As the size of these FMs only realistically allows this in off-chip memory, this leads to high off-chip bandwidth, which comes at great energy costs. The DepFiN processor chip, presented in this article, overcomes this cost by running CNNs in a deep layer fusion mode, dubbed depth-first execution, made possible by a control flow that supports frequently switching between layers. To furthermore tackle the computational cost as well, the computationally efficient depthwise <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$+$</tex-math> </inline-formula> pointwise (DW <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$+$</tex-math> </inline-formula> PW) layer pairs are explicitly supported in DepFiN by a novel accelerator core that can dynamically change its configuration to manage the low computational intensity of the depthwise layers. Benchmarking measurements show the 12-nm DepFiN chip reaching up to 20 TOPS/W peak, 8.2 TOPS/W on the MC-CNN-fast stereo-matching network excluding input-output (IO) power (at 8-bit 0.6 Vdd) and, crucially, 3.95 TOPS/W with the IO power included on the same network and an up to 18 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> improvement realized by supporting depth-first (MC-CNN-fast at 8-bit, 0.65 V Vdd).

Full Text