Effectiveness of register preloading on CP-PACS node processor

Hiroshi Nakamura ,Masaaki Matsubara ,Taisuke Boku ,Keiichi Itakura ,Koyomi Nakazawa

doi:10.1109/iwia.1997.670412

Abstract

CP-PACS is a massively parallel processor (MPP) for large scale scientific computations. On September 1996, CP-PACS equipped with 2048 processors began its operation at University of Tsukuba. At that time, CP-PACS was the fastest MPP in the world on LINPACK benchmark. CP-PACS was designed to achieve very high performance in large scientific/engineering applications. A is well known that ordinary data cache is not effective in such applications because data size is much larger than cache size and because there is little temporal locality. Thus, a special mechanism for hiding long memory access latency is indispensable. Cache prefetching is a well-known technique for this purpose. In addition to cache prefetching, CP-PACS node processors implement register preloading mechanism. This mechanism enables the processor to transfer required floating-point data directly (not via data cache) between main memory and floating-point registers in pipelined way. We compare register preloading with cache prefetching by measuring real performance of CP-PACS processor and HP PA-8000 processor which implement cache prefetching and/or register preloading.

Full Text