Scientific Computing Kernels on the Cell Processor Samuel Williams, John Shalf, Leonid Oliker Shoaib Kamil, Parry Husbands, Katherine Yelick Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA 94720 { swwilliams,jshalf,loliker,sakamil,pjrhusbands,kayelick } @lbl.gov ABSTRACT The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power de- mands has become of utmost concern to computational sci- entists. As a result, the high performance computing com- munity is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end com- puting systems. Our work contains several novel contribu- tions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil com- putations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the ac- curacy of our model by comparing results against published hardware results, as well as our own implementations on a 3.2GHz Cell blade. Additionally, we compare Cell per- formance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different map- pings of the kernels and demonstrates a simple and effective programming model for Cell’s unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency. INTRODUCTION Over the last decade the HPC community has moved to- wards machines composed of commodity microprocessors as a strategy for tracking the tremendous growth in processor performance in that market. As frequency scaling slows and the power requirements of these mainstream processors con- tinue to grow, the HPC community is looking for alternative architectures that provide high performance on scientific ap- plications, yet have a healthy market outside the scientific community. In this work, we examine the potential of the recently-released STI Cell processor as a building block for future high-end computing systems, by investigating perfor- mance across several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil com- putations on regular grids, as well as 1D and 2D FFTs. Cell combines the considerable floating point resources re- quired for demanding numerical algorithms with a power- efficient software-controlled memory hierarchy. Despite its radical departure from previous mainstream/commodity pro- cessor designs, Cell is particularly compelling because it will be produced at such high volumes that it will be cost- competitive with commodity CPUs. The current implemen- tation of Cell is most often noted for its extremely high per- formance single-precision arithmetic, which is widely consid- ered insufficient for the majority of scientific applications. Although Cell’s peak double precision performance is still impressive relative to its commodity peers (˜14.6 Gflop/s @ 3.2GHz), we explore how modest hardware changes could significantly improve performance for computationally in- tensive double precision applications. This paper presents several novel results and expands our previous efforts [37]. We present quantitative performance data for scientific kernels that compares Cell performance to leading superscalar (AMD Opteron), VLIW (Intel Ita- nium2), and vector (Cray X1E) architectures. We believe this study examines the broadest array of scientific algo- rithms to date on Cell. We developed both analytical mod- els and lightweight simulators to predict kernel performance that we demonstrated to be accurate when compared against published Cell hardware results, as well as our own imple- mentations on a 3.2GHz Cell blade. Our work also explores the complexity of mapping several important scientific algo- rithms onto the Cell’s unique architecture in order to lever- age the large number of available functional units and the software-controlled memory. Additionally, we propose mod- est microarchitectural modifications that would increase the efficiency of double-precision arithmetic calculations com- pared with the current Cell implementation. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency. We exploit Cell’s heterogeneity not in computation, but in control and system support. Thus we conclude that Cell’s heterogeneous multi-core implementation is inherently better suited to the HPC environment than homogeneous commodity multicore