Abstract

Transformer models achieve excellent results in the fields like natural language processing, computer vision, and bioinformatics. Their large numbers of matrix multiplications (MMs) lead to substantial data movement and computation. Although computing-in-memory (CIM) has proven to be an efficient architecture for MM computation, transformer’s attention mechanism raises new challenges in memory access and computation aspects: the dynamic MM in attention layers causes redundant off-chip memory access; Attention layers dominate transformer’s computation and require high precision. Thus, we design a bitline-transpose CIM-based transformer accelerator TranCIM with pipeline/parallel reconfigurable modes. The pipeline mode alleviates off-chip access for attention layers. The parallel mode is used by fully-connected (FC) layers for high parallelism. The full-digital CIM supports INT16 for attention layers and INT8 for FC layers, without analog CIM’s nonideal issues. Moreover, a sparse attention scheduler (SAS) is proposed to reduce attention computation. The fabricated TranCIM chip only consumes 15.59 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu$</tex-math> </inline-formula> J/Token for the bidirectional encoder representations from transformer (BERT)-base model, achieving 12.08 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> –36.82 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> lower energy than prior CIM-based accelerators.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call