Scaling up next generation supercomputers

Valentina Salapura

doi:10.1145/1366230.1366232

Abstract

Historically, technology has been the main driver of computer performance. For many system generations, CMOS scaling has been leveraged to increase clock speed and build increasingly complex microarchitectures. As technology-driven performance gains are becoming increasingly harder to achieve, innovative system architecture must step in. In the context of the design of the Blue Gene/P supercomputer chip, we will discuss how we adopted a holistic approach to optimization of the entire hardware and software stack for a range of metrics: performance, power, power/performance, reliability and ease of use.The new Blue Gene/P chip multiprocessor (CMP) scales node performance using a multi-core system-on-a-chip design. While in the past large symmetric multi processor (SMP) designs were sized to handle large amounts of coherence traffic, many modern CMP designs find this cost prohibitive in terms of area, power dissipation, and design complexity. As multi-core processors evolve to larger configurations, the performance loss due to handling coherence traffic must be carefully managed. Thus, to ensure high efficiency of each quad-processor node in Blue Gene/P, taming the cost of coherence of traditional SMP designs was a key requirement.The new Blue Gene/P chip multiprocessor exploits a novel way of reducing coherence cost by filtering useless coherence actions. Each processor core is paired with a snoop filter which identifies and discards unnecessary coherence requests before they can reach the processor cores. Removing unnecessary lookups reduces the interference of invalidate requests with L1 data cache accesses, and reduces power by eliminating expensive tag array accesses. This approach results in improved power and performance characteristics.To optimize application performance, we exploit parallelism at multiple levels: at the process-level, thread-level, data-level, and instruction-level. Hardware supported coherence allows applications to efficiently share data between threads on different processors for thread-level parallelism, while the dual floating point unit and the dual-issue out-of-order PowerPC450 processor core exploit data and instruction level parallelism, respectively. To exploit process-level parallelism, special emphasis was put on efficient communication primitives by including hardware support for the MPI protocol, such as low latency barriers, and five highly optimized communication networks. A new high performance DMA unit supports high throughput data transfers.As the result of this deliberate design for scalability approach, Blue Gene supercomputers offer unprecedented scalability, in some cases by orders of magnitude, to a wide range of scientific applications. A broad range of scientific applications on Blue Gene supercomputers have advanced scientific discovery, which is the real merit and ultimate measure of success of the Blue Gene system family.

Full Text