Commodity CPUs Research Articles

Graphics processors (GPUs) have recently emerged as powerful coprocessors for general purpose computation. Compared with commodity CPUs, GPUs have an order of magnitude higher computation power as well as memory bandwidth. Moreover, new-generation GPUs allow writes to random memory locations, provide efficient interprocessor communication through on-chip local memory, and support a general purpose parallel programming model. Nevertheless, many of the GPU features are specialized for graphics processing, including the massively multithreaded architecture, the Single-Instruction-Multiple-Data processing style, and the execution model of a single application at a time. Additionally, GPUs rely on a bus of limited bandwidth to transfer data to and from the CPU, do not allow dynamic memory allocation from GPU kernels, and have little hardware support for write conflicts. Therefore, a careful design and implementation is required to utilize the GPU for coprocessing database queries. In this article, we present our design, implementation, and evaluation of an in-memory relational query coprocessing system, GDB, on the GPU. Taking advantage of the GPU hardware features, we design a set of highly optimized data-parallel primitives such as split and sort, and use these primitives to implement common relational query processing algorithms. Our algorithms utilize the high parallelism as well as the high memory bandwidth of the GPU, and use parallel computation and memory optimizations to effectively reduce memory stalls. Furthermore, we propose coprocessing techniques that take into account both the computation resources and the GPU-CPU data transfer cost so that each operator in a query can utilize suitable processors—the CPU, the GPU, or both—for an optimized overall performance. We have evaluated our GDB system on a machine with an Intel quad-core CPU and an NVIDIA GeForce 8800 GTX GPU. Our workloads include microbenchmark queries on memory-resident data as well as TPC-H queries that involve complex data types and multiple query operators on data sets larger than the GPU memory. Our results show that our GPU-based algorithms are 2--27x faster than their optimized CPU-based counterparts on in-memory data. Moreover, the performance of our coprocessing scheme is similar to, or better than, both the GPU-only and the CPU-only schemes.

Scientiﬁc Computing Kernels on the Cell Processor Samuel Williams, John Shalf, Leonid Oliker Shoaib Kamil, Parry Husbands, Katherine Yelick Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA 94720 { swwilliams,jshalf,loliker,sakamil,pjrhusbands,kayelick } @lbl.gov ABSTRACT The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power de- mands has become of utmost concern to computational sci- entists. As a result, the high performance computing com- munity is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end com- puting systems. Our work contains several novel contribu- tions. First, we introduce a performance model for Cell and apply it to several key scientiﬁc computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil com- putations, and 1D/2D FFTs. The diﬃculty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the ac- curacy of our model by comparing results against published hardware results, as well as our own implementations on a 3.2GHz Cell blade. Additionally, we compare Cell per- formance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several diﬀerent map- pings of the kernels and demonstrates a simple and eﬀective programming model for Cell’s unique architecture. Finally, we propose modest microarchitectural modiﬁcations that could signiﬁcantly increase the eﬃciency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientiﬁc computations in terms of both raw performance and power eﬃciency. INTRODUCTION Over the last decade the HPC community has moved to- wards machines composed of commodity microprocessors as a strategy for tracking the tremendous growth in processor performance in that market. As frequency scaling slows and the power requirements of these mainstream processors con- tinue to grow, the HPC community is looking for alternative architectures that provide high performance on scientiﬁc ap- plications, yet have a healthy market outside the scientiﬁc community. In this work, we examine the potential of the recently-released STI Cell processor as a building block for future high-end computing systems, by investigating perfor- mance across several key scientiﬁc computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil com- putations on regular grids, as well as 1D and 2D FFTs. Cell combines the considerable ﬂoating point resources re- quired for demanding numerical algorithms with a power- eﬃcient software-controlled memory hierarchy. Despite its radical departure from previous mainstream/commodity pro- cessor designs, Cell is particularly compelling because it will be produced at such high volumes that it will be cost- competitive with commodity CPUs. The current implemen- tation of Cell is most often noted for its extremely high per- formance single-precision arithmetic, which is widely consid- ered insuﬃcient for the majority of scientiﬁc applications. Although Cell’s peak double precision performance is still impressive relative to its commodity peers (˜14.6 Gﬂop/s @ 3.2GHz), we explore how modest hardware changes could signiﬁcantly improve performance for computationally in- tensive double precision applications. This paper presents several novel results and expands our previous eﬀorts [37]. We present quantitative performance data for scientiﬁc kernels that compares Cell performance to leading superscalar (AMD Opteron), VLIW (Intel Ita- nium2), and vector (Cray X1E) architectures. We believe this study examines the broadest array of scientiﬁc algo- rithms to date on Cell. We developed both analytical mod- els and lightweight simulators to predict kernel performance that we demonstrated to be accurate when compared against published Cell hardware results, as well as our own imple- mentations on a 3.2GHz Cell blade. Our work also explores the complexity of mapping several important scientiﬁc algo- rithms onto the Cell’s unique architecture in order to lever- age the large number of available functional units and the software-controlled memory. Additionally, we propose mod- est microarchitectural modiﬁcations that would increase the eﬃciency of double-precision arithmetic calculations com- pared with the current Cell implementation. Overall results demonstrate the tremendous potential of the Cell architecture for scientiﬁc computations in terms of both raw performance and power eﬃciency. We exploit Cell’s heterogeneity not in computation, but in control and system support. Thus we conclude that Cell’s heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multicore

Commodity CPUs Research Articles

Articles published on Commodity CPUs

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

Accel-Align: a fast sequence mapper and aligner based on the seed\u2013embed\u2013extend method

Mining Discriminative K-Mers in DNA Sequences Using Sketches and Hardware Acceleration

GAN Dissection and Datacenter RPCs

FlexSaaS

Featherlight on-the-fly false-sharing detection

Parallel Rendering for Legible Illustrative Visualizations of Dense Geometries on Commodity CPUs

RIFFA 2.1

To lock, swap, or elide

Ziria

Ziria

Fast Exact ILP Decompositions for Ring RWA

Relational query coprocessing on graphics processors

Cluster versus grid for operational generation of ATCOR’s modtran-based look up tables

Scientific Computing Kernels on the Cell Processor

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Commodity CPUs Research Articles

Articles published on Commodity CPUs

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

Accel-Align: a fast sequence mapper and aligner based on the seed\u2013embed\u2013extend method

Mining Discriminative K-Mers in DNA Sequences Using Sketches and Hardware Acceleration

GAN Dissection and Datacenter RPCs

FlexSaaS

Featherlight on-the-fly false-sharing detection

Parallel Rendering for Legible Illustrative Visualizations of Dense Geometries on Commodity CPUs

RIFFA 2.1

To lock, swap, or elide

Ziria

Ziria

Fast Exact ILP Decompositions for Ring RWA

Relational query coprocessing on graphics processors

Cluster versus grid for operational generation of ATCOR’s modtran-based look up tables

Scientific Computing Kernels on the Cell Processor