Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the self-attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of dynamic random access memory (DRAM) accesses. Hence, traditional deep neural network (DNN) accelerators such as graphical processing units (GPUs) and tensor processing units (TPUs) face limitations in processing Transformers efficiently. In-memory accelerators based on nonvolatile memory (NVM) promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix–vector multiplications (MVMs) within memory arrays. However, attention score computations, which are frequently used in Transformers unlike convolutional neural networks (CNNs) and recurrent neural network (RNNs), require MVMs where both the operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers and further suffer from the low endurance of most NVM technologies. To address these challenges, we present, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that achieves up to 69.8 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> and 13 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and up to 24.1 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> and 7.95 <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\times$</tex-math> </inline-formula> improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.
Read full abstract