Abstract

In this article, a new multiple-mode floating-point fused multiply–add (FMA) unit is proposed for deep learning processors. The proposed design supports three functional modes—normal FMA mode, mixed FMA mode, and dual FMA mode—and four types of precision—single-precision (SP), half-precision (HP), BFloat16 (BF16), and TensorFloat-32 (TF32)—based on the practical requirements of deep learning applications. In the normal FMA mode, conventional FMA operations, one SP operation or two parallel HP operations, are performed every clock cycle. In the mixed FMA mode and dual FMA mode, mixed-precision operations, the fused multiply–accumulate and the dot-product, are implemented, respectively. Specifically, the product of lower precision multiplication can be accumulated to a higher precision addend. Compared with the mixed FMA mode, the throughput is doubled in the dual FMA mode due to the full utilization of the multiplier operand bandwidth. In addition to FMA operations, numerical precision conversion (NPCvt) is also supported in this work: higher precision FMA results can be converted into lower precision numbers, corresponding to the datatype transform in the datapath of deep neural network (DNN) training. The FMA design presented herein uses both the segmentation and reusing methods to trade off performance, such as throughput and latency, against area, and power. Compared with the state-of-the-art multiple-precision FMA unit, the proposed design supports more types of floating-point operation and NPCvt, with higher throughput and lower hardware overhead.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call