Performance study on CUDA GPUs for parallelizing the local ensemble transformed Kalman filter algorithm

Timothy Blattner,Shiming Yang

doi:10.1002/cpe.1859

Abstract

SUMMARYModern graphics cards provide computational capabilities that exceed current CPUs. As one of the computational intensive problems, numerical weather prediction has the opportunity to benefit from the massive number of threads and large memory throughput in the graphics architecture. In this paper, we present the key steps to integrate the Compute Unified Device Architecture (CUDA) programming framework for one key component in numerical weather prediction, the data assimilation algorithm, which incorporates the observational data into the model to produce the best initial condition in the next prediction. The data assimilation algorithm we studied in this paper exhibits good localization and favors parallelism. To maximize the throughput of the graphics card, over a million CUDA threads, global memory coalescing, and fast graphics shared memory are utilized. We also demonstrate the differences in the advancement of GPU architectures from the GTX 200 series to Fermi. The experiments are carried out separately on a GTX 260 (GTX 200 series) and a GTX 460 (Fermi) graphics card. Results show an improvement of 72.1× speedup running on the GTX 260 and 92.7×speedup on the GTX 460. The results provide attractive evidence for applying CUDA GPUs to high demanding scientific computation realms. Copyright © 2011 John Wiley & Sons, Ltd.

Full Text