Abstract

We present a 40-nm multi-scale object detection processor with only three operations: $3\times 3$ convolution, $1\times 1$ convolution, and $4\times 4$ deconvolution. The multi-scale object detection at high accuracy is possible by virtue of the deconvolution feature. Input memory for a feature map has 8-bit width as well as a multiplier for the inputs has 8-bit precision. Partial-sum memory, however, has 16-bit width to suppress detection accuracy deterioration in a layer with 512 channels or more. By fixed-point bit precision, the external memory bandwidth and internal memory capacity are reduced. optimized parallelization in input and output channels reduces the external memory bandwidth to 0.50 GB per $1280\times 384$ image with internal memory capacity of 400 kB. The detection error is 1.9% of that using single-precision floating point. The maximum operating frequency is 500 MHz at a supply voltage of 1 V. Its peak performance is 1.15 TOPS. The maximum energy efficiency is 6.57 TOPS/W at 174 MHz and 0.6 V.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call