Abstract
With the popularity of Transformer neural networks, it is inevitable for hardware accelerators to perform nonlinear computation mainly based on the softmax operation. However, a better compromise between the algorithm performance and hardware overhead is always a constant challenge. Hence, this brief advances a tiny and efficient architecture named TEA-S to implement the softmax function with the universal approximation scheme based on Piecewise Linear Approximation Computation (PLAC). With the first co-optimization of calculation and memory, TEA-S can better achieve the design goals of the tiny area and high efficiency. The implementation results show that the peak efficiency of processing 8-bit quantized data will be 487.51 Gps/(mm2∙mW) with the tiny area of 3052.43 μm2 at the frequency of 0.5 GHz under 90-nm CMOS technology. Moreover, TEA-S can offer the universal solution to any lengths of input sequences, providing negligible accuracy loss in Transformers compared to the quantized baselines.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have