Abstract

Transformer-based models achieve tremendous success in many artificial intelligence (AI) tasks, outperforming conventional convolution neural networks (CNNs) from natural language processing (NLP) to computer vision (CV). Their success relies on the self-attention mechanism that provides a global rather than local receptive field as CNNs. Despite its superiority, the global–level self-attention consumes <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\sim 100\times $ </tex-math></inline-formula> more operations than CNNs and cannot be effectively handled by the existing CNN processor due to the distinct operations. It inspires an urgent requirement to design a dedicated Transformer processor. However, global self-attention involves massive naturally existent weakly related tokens (WR-Tokens) due to the redundant contents in human languages or images. These WR-Tokens generate zero and near-zero attention results that introduce energy consumption bottleneck, redundant computations, and hardware under-utilization issues, making it challenging to achieve energy-efficient self-attention computing. This article proposes a Transformer processor effectively handling the WR-Tokens to solve these challenges. First, a big-exact-small-approximate processing element (PE) reduces multiply-and-accumulate (MAC) energy for WR-Tokens by adaptively computing the small values approximately while computing the large values exactly. Second, a bidirectional asymptotical speculation unit captures and removes redundant computations of zero attention outputs by exploiting the local property of self-attention. Third, an out-of-order PE-line computing scheduler improves hardware utilization for near-zero values by reordering the operands to dovetail two operations into one multiplication. Fabricated in a 28-nm CMOS technology, the proposed processor occupies an area of 6.82 mm2. When evaluated with a 90% of approximate computing for the generative pre-trained transformer 2 (GPT-2) model, the peak energy efficiency is 27.56 TOPS/W under 0.56 V at 50 MHz, <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$17.66\times $ </tex-math></inline-formula> higher than A100 graphics processing unit (GPU). Compared with the state-of-the-art Transformer processor, it reduces energy by <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$4.57\times $ </tex-math></inline-formula> and offers <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3.73\times $ </tex-math></inline-formula> speedup.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.