GPU Cache Research Articles

Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache contention. This paper proposes Adaptive PREfetching and Scheduling (APRES) to improve GPU cache efficiency. APRES relies on the following observations. First, certain static load instructions tend to generate memory addresses having very high locality. Second, although loads have no locality, the access addresses still can show highly strided access pattern. Third, the locality behavior tends to be consistent regardless of warp ID. APRES schedules warps so that as many cache hits generated as possible before any cache misses generated. This is to minimize cache thrashing when many warps are contending for a cache line. However, to realize this operation, it is required to predict which warp will hit the cache in the near future. Without directly predicting future cache hit/miss for each warp, APRES creates a group of warps that will execute the same load instruction in the near future. Based on the third observation, we expect the locality behavior is consistent over all warps in the group. If the first executed warp in the group hits the cache, then the load is considered as a high locality type, and APRES prioritizes all warps in the group. Group prioritization leads to consecutive cache hits, because the grouped warps are likely to access the same cache line. If the first warp missed the cache, then the load is considered as a strided type, and APRES generates prefetch requests for the other warps in the group. After that, APRES prioritizes prefetch targeted warps so that the demand requests are merged to Miss Status Holding Register (MSHR) or prefetched lines can be accessed. On memory-intensive applications, APRES achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedup compared to the best combination of existing warp scheduling and prefetching methods.

Purpose: To develop a novel superposition/convolution based algorithm which eliminates the poly‐energetic transport approximation. Methods: We leveraged the hardware functionality of the GPU texture unit to allow our modern dual‐source superposition/convolution based dose calculation engine to efficiently perform multiple transports simultaneously. We experimented with dividing the spectrum in half (dual‐energetic), quarters (quad‐energetic) or N energy bins (multi‐energetic). These divisions were applied after the TERMA was computed using the exact, full‐spectrum attenuation. We have benchmarked the dosimetric properties of poly‐, dual‐, quad‐ and multi‐energetic superposition against a series of Monte Carlo dose accuracy benchmarks based on the ICCR 2000 benchmark and have performed a manual commissioning for an Elekta Infinity operating at 6MV. Results: The performance cost of dual‐energetic superposition was 11%–50%. The performance cost of quad‐energetic superposition was 39%–151%. Performance varied depending on GPU architecture and cache effects. The slower performance of quad‐energetic superposition was due to a smaller CUDA block size and the use of a separate density texture: we normally pack TERMA and density into a single texture. TERMA performance costs were 1% and 10%, respectively. The traditional, poly‐energetic superposition overestimated dose, particularly within the first 10 cm and in bone/aluminum. Dual‐energetic superposition greatly reduced this overestimation. Quad‐energetic and multi‐energetic superposition produced nearly identical results. Good agreement was achieved in air, water, bone and aluminum; all methods had trouble matching the fall off in lung due to the small treatment field. We based our manual commissioning of our Elekta Infinity linear accelerator on a published spectrum. We modeled the extra‐focal source as being very soft, which necessitated a slight hardening of the primary source. Conclusions: We have completed a multi‐energetic, GPU‐accelerated superposition/convolution based algorithm, which improves accuracy over the traditional, poly‐energetic approach and allows the use of physically accurate spectrums.

GPU Cache Research Articles

Related Topics

Articles published on GPU Cache

Parallel Overlapping Community Detection Algorithm on GPU

Technique and Instrument for Effective CPU and GPU Access Request Arbitration Using On-Chip Cache

MeshTaichi

Correction to: Aggressive GPU cache bypassing with monolithic 3D-based NoC

Aggressive GPU cache bypassing with monolithic 3D-based NoC

RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel Executions

RDGC: A Reuse Distance-Based Approach to GPU Cache Performance Analysis

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data

Hybrid Lighting for faster rendering of scenes with many lights

Filtering Translation Bandwidth with Virtual Caching

Z2 traversal order: An interleaving approach for VR stereo rendering on tile-based GPUs

APRES

Data-centric combinatorial optimization of parallel code

K(+)-buffer: An Efficient, Memory-Friendly and Dynamic k-buffer Framework.

SemCache++: semantics-aware caching for efficient multi-GPU offloading

Memory bandwidth optimization of SpMV on GPGPUs

A Simple Multi-Models Rendering Framework - PM-OCMMRF

SU-E-T-719: Multi-Energetic, GPU-Accelerated Superposition/Convolution

Rendering method for large-area terrain based on texture array

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

GPU Cache Research Articles

Related Topics

Articles published on GPU Cache

Parallel Overlapping Community Detection Algorithm on GPU

Technique and Instrument for Effective CPU and GPU Access Request Arbitration Using On-Chip Cache

MeshTaichi

Correction to: Aggressive GPU cache bypassing with monolithic 3D-based NoC

Aggressive GPU cache bypassing with monolithic 3D-based NoC

RDMKE: Applying Reuse Distance Analysis to Multiple GPU Kernel Executions

RDGC: A Reuse Distance-Based Approach to GPU Cache Performance Analysis

Shared Last-Level Cache Management and Memory Scheduling for GPGPUs with Hybrid Main Memory

GFlink: An In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big Data

Hybrid Lighting for faster rendering of scenes with many lights

Filtering Translation Bandwidth with Virtual Caching

Z2 traversal order: An interleaving approach for VR stereo rendering on tile-based GPUs

APRES

Data-centric combinatorial optimization of parallel code

K(+)-buffer: An Efficient, Memory-Friendly and Dynamic k-buffer Framework.

SemCache++: semantics-aware caching for efficient multi-GPU offloading

Memory bandwidth optimization of SpMV on GPGPUs

A Simple Multi-Models Rendering Framework - PM-OCMMRF

SU-E-T-719: Multi-Energetic, GPU-Accelerated Superposition/Convolution

Rendering method for large-area terrain based on texture array