Stratix 10 NX Architecture and Applications

Martin Langhammer,Bogdan Pasca,Sergey Gribok,Eriko Nurvitadhi

doi:10.1145/3431920.3439293

Abstract

The advent of AI has driven the adoption of high density low precision arithmetic on FPGAs. This has resulted in new methods in mapping both arithmetic functions as well as dataflows onto the fabric, as well as some changes to the embedded DSP Blocks. Technologies outside of the FPGA realm have also evolved, such as the addition of tensor structures for GPUs, and also the introduction of numerous AI ASSPs, all of which have a higher claimed performance and efficiency than current FPGAs. In this paper we will introduce the Stratix 10 NX device (NX), which is a variant of FPGA specifically optimized for the AI application space. In addition to the computational capabilities of the standard programmable soft logic fabric, a new type of DSP Block provides the dense arrays of low precision multipliers typically used in AI implementations. The architecture of the block is tuned for the common matrix-matrix or vector-matrix multiplications in AI, with capabilities designed to work efficiently for both small and large matrix sizes. The base precisions are INT8 and INT4, along with shared exponent support for support block floating point FP16 and FP12 numerics. All additions/accumulations can be done in INT32 or IEEE754 single precision floating point (FP32), and multiple blocks can be cascaded together to support larger matrices. We will also describe methods by which the smaller precision multipliers can be aggregated to create larger multiplier that are more applicable to standard signal processing requirements. In terms of overall compute throughput, Stratix 10 NX achieves 143 INT8/FP16 TOPs/FLOPs, or 286 INT4/FP12 TOPS/FLOPs at 600MHz. Depending on the configuration, power efficiency is in the range of 1-4 TOPs or TFLOPs/W.

Full Text