Deep learning applications have become ubiquitous in today’s era and it has led to vast development in machine learning (ML) accelerators. Systolic arrays have been a primary part of ML accelerator architecture. To fully leverage the systolic arrays, it is required to explore the computer arithmetic data-path components and their tradeoffs in accelerators. We present a novel factored systolic array (FSA) architecture, in which the carry propagation adder (CPA) and carry-save adder (CSA) perform hybrid accumulation on least significant bit (LSB) bits and most significant bits (MSB) bits, respectively, inside each processing element. In addition, a small CPA to complete accumulation for MSB bits along with rounding logic for each column of the array is placed, which not only reduces the area, delay, and power but also balances the combinational and sequential area tradeoffs. We demonstrate the hybrid accumulator with partial CPA factoring in “Gemmini,” an open-source practical systolic array accelerator and factoring technique does not change the functionality of the base design. We implemented three baselines, original Gemmini and two variants of it, and show that the proposed approach leads to overall significant reduction in area within the range 12.8% – 50.2% and in power within the range 18.6% – 41% with improved or similar delay in comparison to the baselines.
Read full abstract