Abstract

In this paper, a new method for accelerating the 2D direct Convolution operation on x86/x64 processors is presented. It includes efficient vectorization by using SIMD intrinsics, bit-twiddling optimizations, the optimization of the division operation, multi-threading using OpenMP, register blocking and the shortest possible bit-width value of the intermediate results. The proposed method, which is provided as open-source, is general and can be applied to other processor families too, e.g., Arm. The proposed method has been evaluated on two different multi-core Intel CPUs, by using twenty different image sizes, 8-bit integer computations and the most commonly used kernel sizes (3x3, 5x5, 7x7, 9x9). It achieves from <inline-formula><tex-math notation="LaTeX">$2.8\times$</tex-math></inline-formula> to <inline-formula><tex-math notation="LaTeX">$40\times$</tex-math></inline-formula> speedup over the Intel IPP library (OpenCV GaussianBlur and Filter2D routines), from <inline-formula><tex-math notation="LaTeX">$105 \times$</tex-math></inline-formula> to <inline-formula><tex-math notation="LaTeX">$400 \times$</tex-math></inline-formula> speedup over the gemm-based convolution method (by using Intel MKL int8 matrix multiplication routine), and from <inline-formula><tex-math notation="LaTeX">$8.5\times$</tex-math></inline-formula> to <inline-formula><tex-math notation="LaTeX">$618\times$</tex-math></inline-formula> speedup over the vslsConvExec Intel MKL direct convolution routine. The proposed method is superior as it achieves far fewer arithmetical and load/store instructions.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call