Design and Scaffolded Training of an Efficient DNN Operator for Computer Vision on the Edge

Vinod Ganesan,Pratyush Kumar

doi:10.1145/3511212

Abstract

Massively parallel systolic arrays and resource-efficient depthwise separable convolutions are two promising hardware and software techniques to accelerate DNN inference on the edge. Interestingly, their combination is inefficient: Computational patterns of depthwise separable convolutions do not exhibit a rhythmic systolic flow and lack sufficient data reuse to saturate systolic arrays. In this article, we formally analyse this inefficiency and propose an efficient operator, an optimal hardware dataflow, and a superior training methodology towards alleviating this. The efficient operator, called Fully-Separable Convolutions (FuSeConv) , 1 is a drop-in replacement for depthwise-separable convolutions. FuSeConv generalizes factorization of convolution fully along their spatial and depth dimensions. The resultant computation is systolic and efficiently maps to systolic arrays. The optimal hardware dataflow, called Spatial-Tiled Output Stationary (ST-OS) , maximizes the efficiency of FuSeConv on systolic arrays. It maps independent convolutions to rows of the systolic array to maximise resource-utilization with negligible VLSI overheads. Neural Operator Scaffolding (NOS) scaffolds the training of FuSeConv operators by distilling knowledge from the more expensive depthwise separable convolution operation. This bridges the accuracy gap between FuSeConv networks and networks with depthwise-separable convolutions. Additionally, NOS can be combined with Neural Architecture Search (NAS) to trade off latency and accuracy. The hardware-software co-design of FuSeConv with ST-OS achieves a significant speedup of 4.1-9.25× with state-of-the-art efficient networks for the ImageNet dataset. The parameter efficiency of FuSeConv and its significant superiority over depthwise-separable convolutions on systolic arrays illustrates their promise as a strong solution on the edge. Training FuSeConv networks with NOS achieves accuracy comparable to the depthwise-separable convolution baselines. Further, by combining NOS with NAS, we design networks that define state-of-the-art models improving on both accuracy and latency for computer vision on systolic arrays.

Full Text