DITVA: Dynamic Inter-Thread Vectorization Architecture

Sajith Kalathingal,Sylvain Collange,Bharath N Swamy,André Seznec

doi:10.1016/j.jpdc.2017.11.006

Abstract

In the Single-Program Multiple-Data (SPMD) programming model, threads of an application exhibit very similar control flows and often execute the same instructions, but on different data. In this paper, we propose the Dynamic Inter-thread Vectorization Architecture (DITVA) to leverage the implicit Data Level Parallelism that exists across threads on SPMD applications.By assembling dynamic vector instructions at runtime, DITVA extends an in-order SMT processor with a dynamic inter-thread vector execution mode akin to the Single-Instruction, Multiple-Thread model of Graphics Processing Units. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. DITVA can leverage existing SIMD units and maintains binary compatibility with existing CPU architectures. To balance thread- and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. Additionally, to maximize dynamic vectorization opportunities, we adapt the fetch steering policy to favor thread synchronization within warps and thus improve lockstep execution.Our experimental evaluation of the DITVA architecture on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks show that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance compared to a 4-thread 4-issue SMT architecture with AVX instructions, while fetching and issuing 51% fewer instructions, and achieving an overall 24% energy reduction. DITVA also enables applications limited by memory to scale with higher bandwidth architectures. For instance, when the bandwidth is increased from 2GB/s to 16GB/s, we find that memory bound applications show an improvement in performance by 3× in comparison with the baseline SMT. Therefore, DITVA appears as a cost-effective design for achieving very high single-core performance on SPMD parallel sections.

Highlights

Single-Program Multiple-Data (SPMD) applications express parallelism by creating multiple instruction streams executed by scalar threads running the same programPreprint submitted to Journal of Parallel and Distributed Computing but operating on different data
Many studies have focused on optimizing the instruction fetch policy and leaving the instruction core unchanged while other studies have pointed out the ability to benefit from memory level parallelism through resource sharing policies
Graphics Processing Unit (GPU) architectures to exploit inter-thread redundancies SIMT architectures can vectorize the execution of multi-threaded applications at warp granularity, but they require a specific instruction set to convey branch divergence and reconvergence information to the hardware

Summary

Introduction

Single-Program Multiple-Data (SPMD) applications express parallelism by creating multiple instruction streams executed by scalar threads running the same program. The implicit data level parallelism (DLP) that exists across the threads of an SPMD program is neither captured by the programming model – threads execute asynchronously – nor leveraged by current processors. DITVA extends an in-order SMT architecture by dynamically aggregating instruction instances from different threads and steering them to SIMD units. It even supports efficiently explicit SIMD instruction sets such as SSE and AVX on the same physical execution units, allowing programmers and compilers to freely combine explicitly-vectorized SIMD code and implicitly-vectorized SPMD code.

Motivation

Instruction redundancy across SPMD threads

Related work

The SIMT execution model

Instruction redundancy in SMT

Thread reconvergence for SPMD applications

The Dynamic Inter-Thread Vectorization Architecture

Data memory accesses

Maintaining lockstep execution

DV-instructions per cycle

Experimental Framework

Throughput

Divergence and mispredictions

Impact of split data TLB

L1 cache bank conflict reduction

Quantitative evaluation

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Parallel and Distributed Computing	Publication Date: Dec 5, 2017
Citations: 1	License type: other-oa

R Discovery Prime

R Discovery Prime

DITVA: Dynamic Inter-Thread Vectorization Architecture

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing

Lead the way for us

Similar Papers

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP
Sajith Kalathingal ... Caroline Collange
-
Sajith Kalathingal, et. al.Sajith Kalathingal ... Caroline Collange
25 Aug 2016
25 Aug 2016

Tuning SPMD Applications in Order to Increase Performability
Hugo Meyer ... Emilio Luque
-
Hugo Meyer, et. al.Hugo Meyer ... Emilio Luque
01 Jul 2013
01 Jul 2013

Developing SPMD applications with load balancing
A Plastino ... N Rodriguez
Parallel Computing | VOL. 29
A Plastino, et. al.A Plastino ... N Rodriguez
10 May 2003
Parallel Computing | VOL. 29

An approach for an efficient execution of SPMD applications on Multi-core environments
Ronal Muresano ... Emilio Luque
Future Generation Computer Systems | VOL. 66
Ronal Muresano, et. al.Ronal Muresano ... Emilio Luque
01 Jul 2016
Future Generation Computer Systems | VOL. 66

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DITVA: Dynamic Inter-Thread Vectorization Architecture

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Parallel and Distributed Computing