Abstract

Convolutional neural networks (CNNs) are now widely used in various common tasks such as image classification, semantic segmentation, and face recognition. Convolution layers are the core layers of CNNs, the computing speed of the convolution layer will directly affect the computing speed of the entire network, thereby affecting the real-time performance. The current general convolutional layer acceleration method is to use the image to column (im2col) algorithm to split the input image into a column matrix, then use the general matrix multiplication (GEMM) to perform matrix multiplication on the column vector and the convolution kernel. This operation can greatly improve the computing speed of the convolutional layer because most computing platforms have more mature optimizations for GEMM. However, DSP is very fast for vector multiplication and addition. In the inference of the convolutional layer, the memory access of the im2col algorithm consumes far more time than the GEMM. This has become a bottleneck for further optimization of computing speed. In this article, I will present an im2col algorithm acceleration method in the case of a single stride based on continuous memory address read. With this method, the speed of the im2col algorithm can be increased by more than 10 times when processing a single-step convolutional layer. This is a portable method. In this article, I'11 show the optimization effects on Xtensa BBE64ep DSP cores and stm32f4 processors.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call