AbstractConvolution is a primary operation in convolution neural networks. The speed of inference is mainly decided by the speed of the convolutional layer. Improving the performance of embedded processors makes it possible to process the inference on embedded devices. In this article, a pipelining strategy of single instruction and multiple data (SIMD) instructions is proposed to finely optimize the process of the 3 × 3 convolution on ARM‐based CPUs. We implement the SIMD group to improve the efficiency of the SIMD pipeline. A tiling method is exploited to increase data reuse during the process. An evaluation model is proposed to guide the design of the tiling method and register allocation. The speed of our implementation is 5.18 times of the GNU compiler collection compiled unoptimized version on RK3288. The effect of our optimizing method is measured by a performance profiling tool, the performance information suggests that the pipelining strategy has a significant effect for both normal and depthwise separable convolution. By implementing multithread processing, the speedup achieves 18.3 compared with the single thread unoptimized version.