Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

S Chatterjee,M Mendell,C A Lapkowski,M Gupta,F G Gustavson,K A Dockser,R Nair,P Bergner,L R Bachega,G K Liu,T J C Ward,P Wu,C D Wait,J A Gunnels

doi:10.1147/rd.492.0377

Abstract

We describe the design of a dual-issue single-instruction, multiple-data-like (SIMD-like) extension of the IBM PowerPC® 440 floating-point unit (FPU) core and the compiler and algorithmic techniques to exploit it. This extended FPU is targeted at both the IBM massively parallel Blue Gene®/L machine and the more pervasive embedded platforms. We discuss the hardware and software codesign that was essential in order to fully realize the performance benefits of the FPU when constrained by the memory bandwidth limitations and high penalties for misaligned data access imposed by the memory hierarchy on a Blue Gene/L node. Using both hand-optimized and compiled code for key linear algebraic kernels, we validate the architectural design choices, evaluate the success of the compiler, and quantify the effectiveness of the novel algorithm design techniques. Our measurements show that the combination of algorithm, compiler, and hardware delivers a significant fraction of peak floating-point performance for compute-bound-kernels, such as matrix multiplication, and delivers a significant fraction of peak memory bandwidth for memorybound kernels, such as DAXPY, while remaining largely insensitive to data alignment.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

Abstract

Talk to us

Similar Papers

More From: IBM Journal of Research and Development

Lead the way for us

Journal: IBM Journal of Research and Development	Publication Date: Mar 1, 2005
Citations: 54

Similar Papers

A High-Performance SIMD Floating Point Unit for BlueGene/L: Architecture, Compilation, and Algorithm Design
...
-
, et. al. ...
29 Sep 2004
29 Sep 2004

Efficient tomographic reconstruction for commodity processors with limited memory bandwidth
Hiroshi Inoue
-
Hiroshi InoueHiroshi Inoue
01 Apr 2016
01 Apr 2016

A 275mW heterogeneous multimedia processor for IC-stacking on Si-interposer
Hyo-Eun Kim ... Lee-Sup Kim
-
Hyo-Eun Kim, et. al.Hyo-Eun Kim ... Lee-Sup Kim
01 Feb 2011
01 Feb 2011

A Performance Model of Dense Matrix Operations on Many-Core Architectures
Guoping Long ... Junchao Zhang
-
Guoping Long, et. al.Guoping Long ... Junchao Zhang
01 Jan 2008
01 Jan 2008

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Design and exploitation of a high-performance SIMD floating-point unit for Blue Gene/L

Abstract

Talk to us

Similar Papers

More From: IBM Journal of Research and Development