Abstract

Abstract We report our implementation experience of a lattice gauge theory code on the Cell Broadband Engine, which is a new heterogeneous multi-core processor. As a typical operation, we take a SU(3) matrix multiplication which is one of the most important parts of lattice gauge theories. Employing full advantage of the Cell/B.E. including SIMD operations and many registers, which enable the full use of the arithmetic units through the loop-unrolling, we obtain about 200 GFLOPS with 16 SPE, which corresponds around 80% of the theoretical peak. To our knowledge, this is the fastest value of this operation obtained on the Cell/B.E. so far. However, when we measure the whole time including the data supply, the speed drops down to about 13 GFLOPS.We found that the bandwidth of the data transfer between the main memory and EIB, 25 GB/s, is a bottleneck. In other words, it is possible to run the arithmetic units on the Cell/B.E. with 200 GFLOPS speed, but the current socket structure of Cell/B.E. prevents it. We discuss several techniques to improve the problem partially by reducing the transferred data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call