Abstract

SIMD acceleration can potentially boost by factors the application throughput. Achieving efficient SIMD vectorization for scalar code with complex data flow and branching logic, goes however way beyond breaking some loop dependencies and relying on the compiler. Since the refactoring effort scales with the number of lines of code, it is important to understand what kind of performance gains can be expected in such complex cases. We started to investigate a couple of years ago a top to bottom vectorization approach to particle transport simulation. Percolating vector data to algorithms was mandatory since not all the components can internally vectorize. Vectorizing low-level algorithms is certainly necessary, but not sufficient to achieve relevant SIMD gains. In addition, the overheads for maintaining the concurrent vector data flow and copy data have to be minimized. In the context of a vectorization R&D for simulation we developed a framework to allow different categories of scalar and vectorized components to co-exist, dealing with data flow management and real-time heuristic optimizations. The paper describes our approach on coordinating SIMD vectorization at framework level, making a detailed quantitative analysis of the SIMD gain versus overheads, with a breakdown by components in terms of geometry, physics and magnetic field propagation. We also present the more general context of this R&D work and goals for 2018.

Highlights

  • Due to the physical constraints preventing frequency scaling, parallel computing has become the dominant paradigm in modern computer architectures

  • Both approaches are broadcasting the same instruction to different execution units, the main differences coming from the different degrees of flexibility versus efficiency

  • While the benefit of SIMD and/or SIMT was demonstrated for applications featuring massive data parallelism, such as linear algebra or graphics, we are trying to develop vectorization techniques that can preserve these benefits in case of code with large complexity and branching

Read more

Summary

Introduction

Due to the physical constraints preventing frequency scaling, parallel computing has become the dominant paradigm in modern computer architectures. In SIMD, elements of short vectors are processed in parallel using special vector registers and an extended instruction set, while in SIMT, instructions of several threads run in parallel. Both approaches are broadcasting the same instruction to different execution units, the main differences coming from the different degrees of flexibility versus efficiency. While the benefit of SIMD and/or SIMT was demonstrated for applications featuring massive data parallelism, such as linear algebra or graphics, we are trying to develop vectorization techniques that can preserve these benefits in case of code with large complexity and branching. In a final section we try to illustrate this on concrete examples that use specific measurements

Vectorizing on track data
Efficiency versus overhead
Benchmarks and ongoing optimizations
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call