Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs

Daichi Mukunoki,Daisuke Takahashi

doi:10.1109/ipdpsw.2012.175

Abstract

We implemented and evaluated the triple precision Basic Linear Algebra Subprograms (BLAS) subroutines, AXPY, GEMV and GEMM on a Tesla C2050. In this paper, we present a Double Single (D+S) type triple precision floating-point value format and operations. They are based on techniques similar to Double-Double (DD) type quadruple precision operations. On the GPU, the D+S-type operations are more costly than the DD-type operations in theory and in practice. Therefore, the triple precision GEMM, which is a compute-bound operation, is slower than the quadruple precision GEMM. However, the triple precision AXPY and GEMV are memory-bound operations on the GPU, thus their execution time of these triple precision subroutines is close to 3/4 of the quadruple precision subroutines. Therefore, we conclude that the triple precision value format is useful for memory-bound operations, in cases where the quadruple precision is not required, but double precision is not sufficient.

Full Text