Abstract
Systolic arrays are the primary part of modern deep learning accelerators and are being used widely in real-life applications such as self-driving cars. This paper presents a novel factored systolic array, where the carry propagation adder for accumulation and the rounding logic are extracted out from each processing element, which reduces the area, power and delay of the processing elements substantially. The factoring is performed in the column-wise manner and the cost of the factored logic, placed at each column output, is amortized by the processing elements in a column. We demonstrate the proposed factoring in an open source systolic array, Gemmini. The factoring technique does not change the functionality of the base design and is transparent to applications. We show that the proposed technique leads to substantial reduction in area and delay up to 45.3% and 23.7%, respectively, compared to the Gemmini baseline.
Highlights
Machine learning (ML) algorithms have acquired considerable attention after deep learning (DL) demonstrated breakthroughs in various complex tasks such as the ImageNet challenge
We present a novel factored systolic array and demonstrate it using an open-source systolic array, the Gemmini (Gemmini system on chip (SoC) RTL can be generated by following this lab, EE-290-2, Hardware for Machine Learning, Lab-2) [17]
We built the test binaries using bare-metal software (Bare-metal software given in the Gemmini open source repository) test and checked the correctness of both designs in bare-metal environment
Summary
Machine learning (ML) algorithms have acquired considerable attention after deep learning (DL) demonstrated breakthroughs in various complex tasks such as the ImageNet challenge. The vigorous ability of DL to solve complex tasks is not limited to image recognition and applicable in object detection, speech recognition, natural language processing, etc. Deep learning models require massive amounts of computation and large memory footprints, and recent research have focused on DL accelerators [4]. The matrix multiplication is the key primitive in computation of ML models, and systolic arrays (SAs) for the matrix multiplication have been adopted widely [5,6]. Systolic arrays, proposed in 1979, are two dimensional mesh that consist of processing elements (PEs) organized in the form of a grid [7,8]. Concurrency and simple architectural characteristics, many industry giants such as Google [9], Nvidia [10], Intel [11]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.