Matched Filtering Accelerated by Tensor Cores on Volta GPUs With Improved Accuracy Using Half-Precision Variables

Takuma Yamaguchi,Shigeki Nakagawa,Aitaro Kato,Kohei Fujita,Tsuyoshi Ichimura

doi:10.1109/lsp.2019.2951305

Takuma Yamaguchi, Shigeki Nakagawa + Show 3 more

Open Access

https://doi.org/10.1109/lsp.2019.2951305

Copy DOI

Abstract

Matched Filtering can be applied to various fields owing to its ability to compute a correlation coefficient of two vectors and detect many template events. With an improvement in observation techniques, massive observation data and templates have been accumulated, in which a reduction of computation cost of Matched Filtering has become an important issue. This computation is mainly matrix-matrix product and Tensor Core on NVIDIA Volta GPU is expected to compute it rapidly. However, actual performance of Tensor Core is usually limited by the bandwidth of shared memory or global memory. In addition, only lower-precision data types are supported in the current API for Tensor Core. Therefore, we have to prevent a decline in accuracy in the computation. In this letter, we designed a Matched Filtering algorithm to solve these problems mentioned above and utilized high arithmetic capacity on Tensor Core. Specifically, we reduced the number of memory access to global memory and shared memory by using low-level description. In addition, we introduced local normalization to reduce the numerical error. We applied our developed kernel to template matching of seismic observation data and compared the performance and the accuracy with cuBLAS, a common library in GPU computation. When we compared the performance with the function in cuBLAS that offered almost the same accuracy as our kernel, we reduced the elapsed time by a factor of 4.74.

Highlights

M ATCHED Filtering [1] is a process of detecting specific pattern in a wave with noise, and it has been applied to various fields, which include signal detection of radar [2], detection of gravitational waves [3], and detection of earthquake events [4]
We focused on Matched Filtering in the time domain
The largest proportion of the computation cost of Matched Filtering is in matrix-matrix product

Summary

INTRODUCTION

M ATCHED Filtering [1] is a process of detecting specific pattern in a wave with noise, and it has been applied to various fields, which include signal detection of radar [2], detection of gravitational waves [3], and detection of earthquake events [4]. With the improvement of measurement technology, massive observation data have been accumulated; reduction of the computation cost in Matched Filtering becomes an important issue. The second point is that current Tensor Core supports only lower precision data types, i.e., a 16-bit floating point number, an 8-bit integer, and an 1-bit integer. If we design the algorithm which satisfies conditions mentioned above, we can achieve benefits of very high performance by Tensor Core operations. This letter proposes an algorithm to accelerate the core computation in Matched Filtering using Tensor Core with 16-bit floating point number. We issue Tensor Core operations with lower memory access cost. We demonstrate that our algorithm attains a reasonable speeding up and improvement in accuracy when compared to cuBLAS [11], the common library for Tensor Core.

METHODOLOGY

APPLICATION EXAMPLE

Evaluation of Performance

Evaluation of Accuracy

Findings

CONCLUSION