Abstract

Prior hardware accelerator designs primarily focused on single-chip solutions for 10 MB-class computer vision models. The GB-class transformer models for natural language processing (NLP) impose challenges on existing accelerator design due to the massive number of parameters and the diverse matrix multiplication (MatMul) workloads involved. This work proposes a heterogeneous 3D-based accelerator design for transformer models, which adopts an interposer substrate with multiple 3D memory/logic hybrid cubes optimized for accelerating different MatMul workloads. An approximate computing scheme is proposed to take advantage of heterogeneous computing paradigms of mixed-signal compute-in-memory (CIM) and digital tensor processing units (TPU). From the system-level evaluation results, 10 TOPS/W energy efficiency is achieved for the BERT and GPT2 model, which is about 2.6× ∼ 3.1× higher than the baseline with 7 nm TPU and stacked FeFET memory.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call