Abstract

In recent years, Convolutional Neural Network (CNN) has been rapidly evolving and the real-time CNN implementations in embedded systems are becoming highly demanding. It is necessary that high performance and real time CNN based implementations be realized in local processors. Conventional approaches designing CNN accelerators focus on reducing the computational workload of CNNs. However, the limited external memory bandwidth has become the main bottleneck of CNN acceleration in embedded systems. Because in deep and large CNN models the feature map pixels and weights, which are numerous and must be stored in external memories, need to be exchanged between off-chip and on-chip memories frequently. Hence the performance is constrained by the limited external memory bandwidth. In this paper, bandwidth efficient architectures for CNN implementation are proposed. The intermediate pixel data are stored on chip and kernel weights are transferred in an efficient way. Compared to mainstream CNN implementation methods, the proposed architectures can efficiently utilize external memory bandwidth while preserving the original throughput.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call