PEACH2

Yuetsu Kodama,Toshihiro Hanawa,Taisuke Boku,Mitsuhisa Sato

doi:10.1145/2693714.2693716

Abstract

In recent years, heterogeneous clusters using accelerators are often used for high performance computing systems. In such clusters, inter-node communication between accelerators requires several memory copies via CPU memory, and the communication latency incurred severely reduces performance. To solve this problem, we have been proposing a Tightly Coupled Accelerators (TCA) architecture intended to reduce the communication latency between accelerators over different nodes. In the TCA architecture, PCI Express packets are used for communication among GPUs over nodes. We developed a communication chip that we call the named PEACH2 chip, to help implement the TCA architecture. In this paper, we describe the details of the design and implementation of the PEACH2 chip, with respect to its routing mechanism and its DMA controller using FPGA. We evaluated the PEACH2 on a new platform that uses the latest Xeon CPU, IvyBridge, and achieved 2.3 GBytes/sec between GPUs over nodes, while the performance was only 880 MBytes/sec on the previous platform with SandyBridge.

Full Text