Memory Stalls Research Articles

Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries. In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.

Read full abstract

This paper describes an experimental evaluation of theprototype Imagine stream processor. Imagine [Imagine: Media processing with streams] is a stream processor that employs a two-level register hierarchy with9.7 Kbytes of local register file capacity and 128 Kbytesof stream register file (SRF) capacity to capture producer-consumerlocality in stream applications. Parallelism is exploitedusing an array of 48 floating-point arithmetic unitsorganized as eight SIMD clusters with a 6-wide VLIW percluster. We evaluate the performance of each aspect ofthe Imagine architecture using a set of synthetic micro-benchmarks,key media processing kernels, and full applications.These micro-benchmarks show that the prototypehardware can attain 7.96 GFLOPS or 25.4 GOPS of arithmeticperformance, 12.7 Gbytes/s of SRF bandwidth, 1.58Gbytes/s of memory system bandwidth, and accept up to2 million stream processor instructions per second from ahost processor.On a set of media processing kernels, Imagine sustainedan average of 43% of peak arithmetic performance. Anevaluation of full applications provides a breakdown ofwhere execution time is spent. Over full applications, Imagineachieves 39.4% of peak performance, of the remainderon average 36.4% of time is lost due to load imbalancebetween arithmetic units in the VLIW clusters and limitedinstruction-level parallelism within kernel inner loops,10.6% is due to kernel startup and shutdown overhead becauseof short stream lengths, 7.6% is due to memory stalls,and the rest is due to insufficient host processor bandwidth.Further analysis included in the paper presents the impactof host instruction bandwidth on application performance,particularly on smaller datasets. In summary, the experimentalmeasurements described in this paper demonstratethe high performance and efficiency of stream processing:operating at 200 MHz, Imagine sustains 4.81 GFLOPS onQR decomposition while dissipating 7.42 Watts.

Read full abstract

Memory Stalls Research Articles

Related Topics

Articles published on Memory Stalls

ThunderRW

Improving execution efficiency of just-in-time compilation based query processing on GPUs

HAWS

Exploiting architectural features of a computer vision platform towards reducing memory stalls

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

TC-Release++: An Efficient Timestamp-Based Coherence Protocol for Many-Core Architectures

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

In-cache query co-processing on coupled CPU-GPU architectures

ADDICT

Profiling R on a contemporary processor

PICA

Multithreading in Java: Performance and Scalability on Multicore Systems

Relational query coprocessing on graphics processors

Evaluating the Imagine Stream Architecture

WHERE DOES THE SPEEDUP GO: QUANTITATIVE MODELING OF PERFORMANCE LOSSES IN SHARED-MEMORY PROGRAMS

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Memory Stalls Research Articles

Related Topics

Articles published on Memory Stalls

ThunderRW

Improving execution efficiency of just-in-time compilation based query processing on GPUs

HAWS

Exploiting architectural features of a computer vision platform towards reducing memory stalls

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

TC-Release++: An Efficient Timestamp-Based Coherence Protocol for Many-Core Architectures

Adaptive Runtime-Assisted Block Prefetching on Chip-Multiprocessors

In-cache query co-processing on coupled CPU-GPU architectures

ADDICT

Profiling R on a contemporary processor

PICA

Multithreading in Java: Performance and Scalability on Multicore Systems

Relational query coprocessing on graphics processors

Evaluating the Imagine Stream Architecture

WHERE DOES THE SPEEDUP GO: QUANTITATIVE MODELING OF PERFORMANCE LOSSES IN SHARED-MEMORY PROGRAMS