Abstract

Back-Projection is the major algorithm in Computed Tomography to reconstruct images from a set of recorded projections. It is used for both fast analytical methods and high-quality iterative techniques. X-ray imaging facilities rely on Back-Projection to reconstruct internal structures in material samples and living organisms with high spatial and temporal resolution. Fast image reconstruction is also essential to track and control processes under study in real-time. In this article, we present efficient implementations of the Back-Projection algorithm for parallel hardware. We survey a range of parallel architectures presented by the major hardware vendors during the last 10 years. Similarities and differences between these architectures are analyzed and we highlight how specific features can be used to enhance the reconstruction performance. In particular, we build a performance model to find hardware hotspots and propose several optimizations to balance the load between texture engine, computational and special function units, as well as different types of memory maximizing the utilization of all GPU subsystems in parallel. We further show that targeting architecture-specific features allows one to boost the performance 2–7 times compared to the current state-of-the-art algorithms used in standard reconstructions codes. The suggested load-balancing approach is not limited to the back-projection but can be used as a general optimization strategy for implementing parallel algorithms.

Highlights

  • X-ray tomography is a powerful tool to investigate materials and small animals at the micro- and nano-scale [1]

  • Our results show that all NVIDIA GPUs starting with Fermi benefit from the 64-bit texture fetches if requests are properly localized

  • The type-conversions are executed at a half rate of the peak floating-point performance on AMD GCN GPUs, but only a single type-conversion instruction can be executed per 12 floating-point operations on NVIDIA Kepler GPUs

Read more

Summary

Introduction

X-ray tomography is a powerful tool to investigate materials and small animals at the micro- and nano-scale [1]. A recent study suggests to implement back projection as convolution in log-polar coordinates in order to gain high reconstruction speed with interpolation in the image domain [23] This new method has not yet been adopted in production environments. Multiple papers perform a general analysis of a range of GPU architectures, reveal undisclosed details trough micro-benchmarking, and propose guidelines for performance optimization [27,28,29] This information is invaluable to understand factors limiting performance on a specific architecture and to find an alternative approach to achieve a better performance. In [31], we presented two highly-optimized back-projection algorithms for NVIDIA Pascal GPUs and a hybrid approach to balance the load between different GPU subsystems using both in parallel.

Hardware platform
Benchmarking strategy
Quality evaluation
Pseudo‐code conventions
Parallel architectures
Hardware architecture
Execution model
Memory hierarchy
Texture engine
Task partitioning
Code generation
Scheduling
Synchronization
Communication
3.10 Summary
Tomographic reconstruction
Back‐projection based on texture engine
Standard version
Multi‐slice reconstruction
Using half‐precision data representation
Efficiency of the standard algorithm
Optimizing locality of texture fetches
Optimizing memory bandwidth
Optimizing occupancy
Summary
Alternative algorithm based on ALUs
The concept
Base implementation
Optimizing the thread mapping to avoid shared memory bank conflicts
AMD and Fermi 32
Modeling
Rounding using floating‐point arithmetic
Method
Half‐float cache
Additional caches
Managing occupancy
6.10 CPU and Xeon Phi
Hybrid approaches
Combined approach for Pascal architecture
Oversampling
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call