Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %.
Read full abstract