Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Mirco Mannino,Andrea Mondelli,Biagio Peccerillo,Sandro Bartolini

doi:10.1109/access.2023.3283312

Mirco Mannino, Andrea Mondelli + Show 2 more

Open Access

https://doi.org/10.1109/access.2023.3283312

Copy DOI

Abstract

Nowadays, convolutional neural networks are among the most widely used types of deep learning networks thanks to their usefulness in many application domains. There are many efforts to find methods to increase their training and inference performance and efficiency. One of the most widely used technique to implement convolution consists of flattening tensors into 2D matrices and carrying out the operation through a matrix-matrix multiplication routine, which has highly optimized implementations in high-performance libraries. However, this kind of approach uses extra time and memory to transform and store the tensors involved. For this reason, <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">direct convolution</i> is becoming increasingly popular. Direct convolution can be implemented as a series of nested loops iterating over tensor dimensions and it does not require extra memory. In this work, we evaluate on various multi-core CPUs the performance and scalability effects deriving from different parallelization strategies, loop organizations, and SIMD-vectorization approaches with different compilers in relation with architectural aspects. We discuss each parameter thoroughly and distill our findings in a set of heuristics that can be used to quickly achieve a high-performance implementation in accordance to the underlying hardware and the characteristics of the convolutional layer at hand. By adopting a per-layer approach, we increase performance up to 60-70% compared to a static implementation for all the layers. Moreover, our results are comparable, or even better (up to <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$1.67\times $ </tex-math></inline-formula> speedup) than matrix-matrix multiplication-based convolution in a multi-core system.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Journal: IEEE Access	Publication Date: Jan 1, 2023
License type: CC BY-NC-ND 4.0

Similar Papers

Accelerating the "Motifs" in Machine Learning on Modern Processors

-

27 May 2021
27 May 2021

I/O lower bounds for auto-tuning of convolutions in CNNs
Xiaoyang Zhang ... Guangming Tan
-
Xiaoyang Zhang, et. al.Xiaoyang Zhang ... Guangming Tan
17 Feb 2021
17 Feb 2021

Automatic generation of specialized direct convolutions for mobile GPUs
Naums Mogers ... Valentin Radu
-
Naums Mogers, et. al.Naums Mogers ... Valentin Radu
23 Feb 2020
23 Feb 2020

A CNN Inference micro-benchmark for Performance Analysis and Optimization on GPUs
Jurn-Gyu Park ... Zhumakhan Nazir
-
Jurn-Gyu Park, et. al.Jurn-Gyu Park ... Zhumakhan Nazir
09 Oct 2022
09 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analysis and Optimization of Direct Convolution Execution on Multi-Core Processors

Abstract

Talk to us

Similar Papers

More From: IEEE Access