Abstract —As a powerful and flexible processor, the Graphic Processing Unit (GPU) can offer a great faculty in solving many high-performance computing applications. Sweep3D, which simulates a single group time-independent discrete ordinates (Sn) neutron transport deterministically on 3D Cartesian geometry space, represents the key part of a real ASCI application. The wavefront process for parallel computation in Sweep3D limits the concurrent threads on the GPU. In this paper, we present multi-dimensional optimization methods for Sweep3D, which can be efficiently implemented on the fine-grained parallel architecture of the GPU. Our results show that the overall performance of Sweep3D on the CPU-GPU hybrid platform can be improved up to 4.38 times as compared to the CPU-based implementation. Keywords —Sweep3D, Neutron Transport, GPU, CUDA 1. I NTRODUCTION When the first GPU was introduced in 1999, the GPU mainly had been used to transform, light and to rasterize triangles in three dimension (3D) graphics applications [1]. The perform-ance of GPU doubles about every six to nine months, which means that it outperforms the Cen-tral Processing Unit (CPU) by a lot [2]. The modern GPUs are throughput-oriented parallel processors that can offer peak performance up to 2.72 Tflops single-precision floating-point and 544 Gflops double-precision floating-point [3]. At the same time, the GPU programming models, such as NVIDIA’s Compute Unified Device Architecture (CUDA) [4], AMD/ATI’s Streaming Computing [5] and OpenCL [6], have matures and they simplify the processing of developing non-graphics applications. The enhancement of computing performance, and the development of programming models and software makes GPU more and more suitable for general purpose computing. At present, GPU has been successfully applied to medical imaging, universe explo-ration, physics simulation, linear system solutions, and other computation intensive domains [7]. There is a growing need to accurately simulate physical systems whose evolutions depend on the transport of subatomic particles coupled with other complex physics [8]. In many simula-tions, particle transport calculations consume the majority of the computational resources. For example, the time devoted to particle transport problems in multi-physics simulations takes up