CUDA SDK Research Articles

Graphics processing units (GPU), due to their massive computational power with up to thousands of concurrent threads and general-purpose GPU (GPGPU) programming models such as CUDA and OpenCL, have opened up new opportunities for speeding up general-purpose parallel applications. Unfortunately, pre-silicon architectural simulation of modern-day GPGPU architectures and workloads is extremely time-consuming. This paper addresses the GPGPU simulation challenge by proposing a framework, called GPGPU-MiniBench, for generating miniature, yet representative GPGPU workloads. GPGPU-MiniBench first summarizes the inherent execution behavior of existing GPGPU workloads in a profile. The central component in the profile is the Divergence Flow Statistics Graph (DFSG), which characterizes the dynamic control flow behavior including loops and branches of a GPGPU kernel. GPGPU-MiniBench generates a synthetic miniature GPGPU kernel that exhibits similar execution characteristics as the original workload, yet its execution time is much shorter thereby dramatically speeding up architectural simulation. Our experimental results show that GPGPU-MiniBench can speed up GPGPU architectural simulation by a factor of 49 $\times$ on average and up to 589 $\times$ , with an average IPC error of 4.7 percent across a broad set of GPGPU benchmarks from the CUDA SDK, Rodinia and Parboil benchmark suites. We also demonstrate the usefulness of GPGPU-MiniBench for driving GPU architecture exploration.

Heterogeneous computing nodes are now pervasive throughout computing, and GPUs have emerged as a leading computing device for application acceleration. GPUs have tremendous computing potential for data-parallel applications, and the emergence of GPUs has led to proliferation of GPU-accelerated applications. This proliferation has also led to systems in which many applications are competing for access to GPU resources, and efficient utilization of the GPU resources is critical to system performance. Prior techniques of temporal multitasking can be employed with GPU resources as well, but not all GPU kernels make full use of the GPU resources. There is, therefore, an unmet need for spatial multitasking in GPUs. Resources used inefficiently by one kernel can be instead assigned to another kernel that can more effectively use the resources. In this paper we propose a software-hardware solution for efficient spatial-temporal multitasking and a software based emulation framework for our system. We pair an efficient heuristic in software with hardware leaky-bucket based thread-block interleaving to implement spatial-temporal multitasking. We demonstrate our techniques on various GPU architecture using nine representative benchmarks from CUDA SDK. Our experiments on Fermi GTX480 demonstrate performance improvement by up to 46% (average 26%) over sequential GPU task execution and 37% (average 18%) over default concurrent multitasking. Compared with the state-of-the-art Kepler K20 using Hyper-Q technology, our technique achieves up to 40% (average 17%) performance improvement over default concurrent multitasking.

CUDA SDK Research Articles

Articles published on CUDA SDK

COX : Exposing CUDA Warp-level Functions to CPUs

ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs

SCELib4.0: The new program version for computing molecular properties in the Single Center Approach

Detecting Undefined Behaviors in CUDA C

Metric Selection for GPU Kernel Classification

VOLSCAT2.0: The new version of the package for electron and positron scattering off molecular targets

DVFS-aware application classification to improve GPGPUs energy efficiency

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation

Unsafe Floating-point to Unsigned Integer Casting Check for GPU Programs

Efficient GPU Spatial-Temporal Multitasking

Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications

A Fast Implementation and Performance Analysis of Collisionless N-body Code Based on GPGPU

A Translation Framework for Executing the Sequential Binary Code on CPU/GPU Based Architectures

Parallel Colt

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

CUDA SDK Research Articles

Articles published on CUDA SDK

COX : Exposing CUDA Warp-level Functions to CPUs

ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs

SCELib4.0: The new program version for computing molecular properties in the Single Center Approach

Detecting Undefined Behaviors in CUDA C

Metric Selection for GPU Kernel Classification

VOLSCAT2.0: The new version of the package for electron and positron scattering off molecular targets

DVFS-aware application classification to improve GPGPUs energy efficiency

GPGPU-MiniBench: Accelerating GPGPU Micro-Architecture Simulation

Unsafe Floating-point to Unsigned Integer Casting Check for GPU Programs

Efficient GPU Spatial-Temporal Multitasking

Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications

A Fast Implementation and Performance Analysis of Collisionless N-body Code Based on GPGPU

A Translation Framework for Executing the Sequential Binary Code on CPU/GPU Based Architectures

Parallel Colt