Iterative reconstruction techniques hold great potential to mitigate the effects of data noise and/or incompleteness, and hence can facilitate the patient dose reduction. However, they are not suitable for routine clinical practice due to their long reconstruction times. In this work, the authors accelerated the computations by fully taking advantage of the highly parallel computational power on single and multiple graphics processing units (GPUs). In particular, the forward projection algorithm, which is not included in the close-form formulas, will be accelerated and optimized by using GPU here. The main contribution is a novel forward projection algorithm that uses multithreads to handle the computations associated with a bunch of adjacent rays simultaneously. The proposed algorithm is free of divergence and bank conflict on GPU, and benefits from data locality and data reuse. It achieves the efficiency particularly by (i) employing a tiled algorithm with three-level parallelization, (ii) optimizing thread block size, (iii) maximizing data reuse on constant memory and shared memory, and (iv) exploiting built-in texture memory interpolation capability to increase efficiency. In addition, to accelerate the iterative algorithms and the Feldkamp-Davis-Kress (FDK) algorithm on GPU, the authors apply batched fast Fourier transform (FFT) to expedite filtering process in FDK and utilize projection bundling parallelism during backprojection to shorten the execution times in FDK and the expectation-maximization (EM). Numerical experiments conducted on an NVIDIA Tesla C1060 GPU demonstrated the superiority of the proposed algorithms in computational time saving. The forward projection, filtering, and backprojection times for generating a volume image of 512 x 512 x 512 with 360 projection data of 512 x 512 using one GPU are about 4.13, 0.65, and 2.47 s (including distance weighting), respectively. In particular, the proposed forward projection algorithm is ray-driven and its paralleli-zation strategy evolves from single-thread-for-single-ray (38.56 s), multithreads-for-single-ray (26.05 s), to multithreads-for-multirays (4.13 s). For the voxel-driven backprojection, the use of texture memory reduces the reconstruction time from 4.95 to 3.35 s. By applying the projection bundle technique, the computation time is further reduced to 2.47 s. When employing multiple GPUs, near-perfect speedups were observed as the number of GPUs increases. For example, by using four GPUs, the time for the forward projection, filtering, and backprojection are further reduced to 1.11, 0.18, and 0.66 s. The results obtained by GPU-based algorithms are virtually indistinguishable with those by CPU. The authors have proposed a highly optimized GPU-based forward projection algorithm, as well as the GPU-based FDK and expectation-maximization reconstruction algorithms. Our compute unified device architecture (CUDA) codes provide the exceedingly fast forward projection and backprojection that outperform those using the shading languages, cell broadband engine architecture and previous CUDA implementations. The reconstruction times in the FDK and the EM algorithms were considerably shortened, and thus can facilitate their routine usage in a variety of applications such as image quality improvement and dose reduction.
Read full abstract