Abstract

The general matrix–matrix multiplication is a core building block for implementing Basic Linear Algebra Subprograms. This paper presents a methodology for automatically producing the matrix–matrix multiplication kernels tuned for the Intel Xeon Phi Processor code-named Knights Landing and the Intel Skylake-SP processors with AVX-512 intrinsic functions. The architecture of the latest manycore processors has been complicated in the levels of parallelism and cache hierarchies; it is not easy to find the best combination of optimization techniques for a given application. Our approach produces matrix multiplication kernels through a process of heuristic auto-tuning based on generating multiple kernels and selecting the fastest ones through performance tests. The tuning parameters include the size of block matrices for registers and caches, prefetch distances, and loop unrolling depth. Parameters for multithreaded execution, such as identifying loops to parallelize and the optimal number of threads for such loops are also investigated. We also present a method to reduce the parameter search space based on our previous research results.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.