CPU Baseline Research Articles

The primary function of multimedia systems is to seamlessly transform and display content to users while maintaining the perception of acceptable quality. For images and videos, perceptual quality assessment algorithms play an important role in determining what is acceptable quality and what is unacceptable from a human visual perspective. As modern image quality assessment (IQA) algorithms gain widespread adoption, it is important to achieve a balance between their computational efficiency and their quality prediction accuracy. One way to improve computational performance to meet real-time constraints is to use simplistic models of visual perception, but such an approach has a serious drawback in terms of poor-quality predictions and limited robustness to changing distortions and viewing conditions. In this paper, we investigate the advantages and potential bottlenecks of implementing a best-in-class IQA algorithm, Most Apparent Distortion, on graphics processing units (GPUs). Our results suggest that an understanding of the GPU and CPU architectures, combined with detailed knowledge of the IQA algorithm, can lead to non-trivial speedups without compromising prediction accuracy. A single-GPU and a multi-GPU implementation showed a 24× and a 33× speedup, respectively, over the baseline CPU implementation. A bottleneck analysis revealed the kernels with the highest runtimes, and a microarchitectural analysis illustrated the underlying reasons for the high runtimes of these kernels. Programs written with optimizations such as blocking that map well to CPU memory hierarchies do not map well to the GPU’s memory hierarchy. While compute unified device architecture (CUDA) is convenient to use and is powerful in facilitating general purpose GPU (GPGPU) programming, knowledge of how a program interacts with the underlying hardware is essential for understanding performance bottlenecks and resolving them.

To compare optimization characteristics, plan quality, and treatment delivery efficiency between total marrow irradiation (TMI) plans using the new TomoTherapy graphic processing unit (GPU) based dose engine and CPU/cluster based dose engine. Five TMI plans created on an anthropomorphic phantom were optimized and calculated with both dose engines. The planning treatment volume (PTV) included all the bones from head to mid femur except for upper extremities. Evaluated organs at risk (OAR) consisted of lung, liver, heart, kidneys, and brain. The following treatment parameters were used to generate the TMI plans: field widths of 2.5 and 5 cm, modulation factors of 2 and 2.5, and pitch of either 0.287 or 0.43. The optimization parameters were chosen based on the PTV and OAR priorities and the plans were optimized with a fixed number of iterations. The PTV constraint was selected to ensure that at least 95% of the PTV received the prescription dose. The plans were evaluated based on D80 and D50 (dose to 80% and 50% of the OAR volume, respectively) and hotspot volumes within the PTVs. Gamma indices (Γ) were also used to compare planar dose distributions between the two modalities. The optimization and dose calculation times were compared between the two systems. The treatment delivery times were also evaluated. The results showed very good dosimetric agreement between the GPU and CPU calculated plans for any of the evaluated planning parameters indicating that both systems converge on nearly identical plans. All D80 and D50 parameters varied by less than 3% of the prescription dose with an average difference of 0.8%. A gamma analysis Γ(3%, 3 mm) < 1 of the GPU plan resulted in over 90% of calculated voxels satisfying Γ < 1 criterion as compared to baseline CPU plan. The average number of voxels meeting the Γ < 1 criterion for all the plans was 97%. In terms of dose optimization/calculation efficiency, there was a 20-fold reduction in planning time with the new GPU system. The average optimization/dose calculation time utilizing the traditional CPU/cluster based system was 579 vs 26.8 min for the GPU based system. There was no difference in the calculated treatment delivery time per fraction. Beam-on time varied based on field width and pitch and ranged between 15 and 28 min. The TomoTherapy GPU based dose engine is capable of calculating TMI treatment plans with plan quality nearly identical to plans calculated using the traditional CPU/cluster based system, while significantly reducing the time required for optimization and dose calculation.

CPU Baseline Research Articles

Articles published on CPU Baseline

PIM GPT a hybrid process in memory accelerator for autoregressive transformers

GPU implementation of the Frenet Path Planner for embedded autonomous systems: A case study in the F1tenth scenario

PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware

GRIP: A Graph Neural Network Accelerator Architecture

Scheduling Bag-of-Tasks in Clouds Using Spot and Burstable Virtual Machines

A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems

NASCENT2: Generic Near-Storage Sort Accelerator for Data Analytics on SmartSSD

A scalable and reconfigurable in-memory architecture for ternary deep spiking neural network with ReRAM based neurons

Addressing Interpretability and Cold-Start in Matrix Factorization for Recommender Systems

Accelerating <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math> </inline-formula>-Medians Clustering Using a Novel 4T-4R RRAM Cell

GPU Acceleration of the Most Apparent Distortion Image Quality Assessment Algorithm

IMEC: A Fully Morphable In-Memory Computing Fabric Enabled by Resistive Crossbar

An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

Sparse Matrix Multiplication On An Associative Processor

Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines

Techniques for Solving Stiff Chemical Kinetics on Graphical Processing Units

Towards accelerating irregular EDA applications with GPUs

Massively Parallel Logic Simulation with GPUs

Exploring utilisation of GPU for database applications

On the computation of the Circle Hough Transform by a GPU rasterizer

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

CPU Baseline Research Articles

Articles published on CPU Baseline

PIM GPT a hybrid process in memory accelerator for autoregressive transformers

GPU implementation of the Frenet Path Planner for embedded autonomous systems: A case study in the F1tenth scenario

PimPam: Efficient Graph Pattern Matching on Real Processing-in-Memory Hardware

GRIP: A Graph Neural Network Accelerator Architecture

Scheduling Bag-of-Tasks in Clouds Using Spot and Burstable Virtual Machines

A Tensor Processing Framework for CPU-Manycore Heterogeneous Systems

NASCENT2: Generic Near-Storage Sort Accelerator for Data Analytics on SmartSSD

A scalable and reconfigurable in-memory architecture for ternary deep spiking neural network with ReRAM based neurons

Addressing Interpretability and Cold-Start in Matrix Factorization for Recommender Systems

Accelerating &lt;inline-formula&gt; &lt;tex-math notation="LaTeX"&gt;$k$ &lt;/tex-math&gt; &lt;/inline-formula&gt;-Medians Clustering Using a Novel 4T-4R RRAM Cell

GPU Acceleration of the Most Apparent Distortion Image Quality Assessment Algorithm

IMEC: A Fully Morphable In-Memory Computing Fabric Enabled by Resistive Crossbar

An OpenCL software compilation framework targeting an SoC-FPGA VLIW chip multiprocessor

Sparse Matrix Multiplication On An Associative Processor

Dosimetric comparison of helical tomotherapy treatment plans for total marrow irradiation created using GPU and CPU dose calculation engines

Techniques for Solving Stiff Chemical Kinetics on Graphical Processing Units

Towards accelerating irregular EDA applications with GPUs

Massively Parallel Logic Simulation with GPUs

Exploring utilisation of GPU for database applications

On the computation of the Circle Hough Transform by a GPU rasterizer

Accelerating <inline-formula> <tex-math notation="LaTeX">$k$ </tex-math> </inline-formula>-Medians Clustering Using a Novel 4T-4R RRAM Cell