Abstract

In the Single-Program Multiple-Data (SPMD) programming model, threads of an application exhibit very similar control flows and often execute the same instructions, but on different data. In this paper, we propose the Dynamic Inter-thread Vectorization Architecture (DITVA) to leverage the implicit Data Level Parallelism that exists across threads on SPMD applications.By assembling dynamic vector instructions at runtime, DITVA extends an in-order SMT processor with a dynamic inter-thread vector execution mode akin to the Single-Instruction, Multiple-Thread model of Graphics Processing Units. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. DITVA can leverage existing SIMD units and maintains binary compatibility with existing CPU architectures. To balance thread- and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. Additionally, to maximize dynamic vectorization opportunities, we adapt the fetch steering policy to favor thread synchronization within warps and thus improve lockstep execution.Our experimental evaluation of the DITVA architecture on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks show that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance compared to a 4-thread 4-issue SMT architecture with AVX instructions, while fetching and issuing 51% fewer instructions, and achieving an overall 24% energy reduction. DITVA also enables applications limited by memory to scale with higher bandwidth architectures. For instance, when the bandwidth is increased from 2GB/s to 16GB/s, we find that memory bound applications show an improvement in performance by 3× in comparison with the baseline SMT. Therefore, DITVA appears as a cost-effective design for achieving very high single-core performance on SPMD parallel sections.

Highlights

  • Single-Program Multiple-Data (SPMD) applications express parallelism by creating multiple instruction streams executed by scalar threads running the same programPreprint submitted to Journal of Parallel and Distributed Computing but operating on different data

  • Many studies have focused on optimizing the instruction fetch policy and leaving the instruction core unchanged while other studies have pointed out the ability to benefit from memory level parallelism through resource sharing policies

  • Graphics Processing Unit (GPU) architectures to exploit inter-thread redundancies SIMT architectures can vectorize the execution of multi-threaded applications at warp granularity, but they require a specific instruction set to convey branch divergence and reconvergence information to the hardware

Read more

Summary

Introduction

Single-Program Multiple-Data (SPMD) applications express parallelism by creating multiple instruction streams executed by scalar threads running the same program. The implicit data level parallelism (DLP) that exists across the threads of an SPMD program is neither captured by the programming model – threads execute asynchronously – nor leveraged by current processors. DITVA extends an in-order SMT architecture by dynamically aggregating instruction instances from different threads and steering them to SIMD units. It even supports efficiently explicit SIMD instruction sets such as SSE and AVX on the same physical execution units, allowing programmers and compilers to freely combine explicitly-vectorized SIMD code and implicitly-vectorized SPMD code.

Motivation
Instruction redundancy across SPMD threads
Related work
The SIMT execution model
Instruction redundancy in SMT
Thread reconvergence for SPMD applications
The Dynamic Inter-Thread Vectorization Architecture
Data memory accesses
Maintaining lockstep execution
DV-instructions per cycle
Experimental Framework
Throughput
Divergence and mispredictions
Impact of split data TLB
L1 cache bank conflict reduction
Quantitative evaluation
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call