Recent progress with the top to bottom approach to vectorization in GeantV

G Amádio ,Marilena Bandieramonte,Shruti Sharan,Farah Hariri,Przemislaw Karpinski,Philippe Canal,Aldo Miranda Aguillar,Gulrukh Khattak,Ananya Ananya,G Cosmo ,I Goulas ,Sandro Christian Wenzel ,S P Behera ,Д И Савин ,J G Rocha De Lima ,Mihály Novák ,J Apostolakis ,J M Gonzalez Castro ,K Pedro ,R Brun ,Mihaela Gheata ,H Kumawat ,A Gheaţă ,V Ivanchenko ,F Carminati ,S Vallecorsa ,Vitaliy Drohan ,A Bhattacharyya ,D Konstantinov ,Witold Pokorski ,R Sehgal ,R.e Schmitz ,Elena Orlova ,P Mendez ,O Shadura ,K Genser ,A Ribon ,D Elvira ,K Nikolics ,S Y Jun

doi:10.1051/epjconf/201921402007

Abstract

SIMD acceleration can potentially boost by factors the application throughput. Achieving efficient SIMD vectorization for scalar code with complex data flow and branching logic, goes however way beyond breaking some loop dependencies and relying on the compiler. Since the refactoring effort scales with the number of lines of code, it is important to understand what kind of performance gains can be expected in such complex cases. We started to investigate a couple of years ago a top to bottom vectorization approach to particle transport simulation. Percolating vector data to algorithms was mandatory since not all the components can internally vectorize. Vectorizing low-level algorithms is certainly necessary, but not sufficient to achieve relevant SIMD gains. In addition, the overheads for maintaining the concurrent vector data flow and copy data have to be minimized. In the context of a vectorization R&D for simulation we developed a framework to allow different categories of scalar and vectorized components to co-exist, dealing with data flow management and real-time heuristic optimizations. The paper describes our approach on coordinating SIMD vectorization at framework level, making a detailed quantitative analysis of the SIMD gain versus overheads, with a breakdown by components in terms of geometry, physics and magnetic field propagation. We also present the more general context of this R&D work and goals for 2018.

Highlights

Due to the physical constraints preventing frequency scaling, parallel computing has become the dominant paradigm in modern computer architectures
Both approaches are broadcasting the same instruction to different execution units, the main differences coming from the different degrees of flexibility versus efficiency
While the benefit of SIMD and/or SIMT was demonstrated for applications featuring massive data parallelism, such as linear algebra or graphics, we are trying to develop vectorization techniques that can preserve these benefits in case of code with large complexity and branching

Summary

Introduction

Due to the physical constraints preventing frequency scaling, parallel computing has become the dominant paradigm in modern computer architectures. In SIMD, elements of short vectors are processed in parallel using special vector registers and an extended instruction set, while in SIMT, instructions of several threads run in parallel. Both approaches are broadcasting the same instruction to different execution units, the main differences coming from the different degrees of flexibility versus efficiency. While the benefit of SIMD and/or SIMT was demonstrated for applications featuring massive data parallelism, such as linear algebra or graphics, we are trying to develop vectorization techniques that can preserve these benefits in case of code with large complexity and branching. In a final section we try to illustrate this on concrete examples that use specific measurements

Vectorizing on track data

Efficiency versus overhead

Benchmarks and ongoing optimizations

Findings

Conclusions