Accelerating Neural Network Inference With Processing-in-DRAM: From the Edge to the Cloud

Geraldo F. Oliveira,Onur Mutlu,Amirali Boroumand,Saugata Ghose,Juan Gomez-Luna

doi:10.1109/mm.2022.3202350

Abstract

Neural networks (NNs) are growing in importance and complexity. An NN’s performance (and energy efficiency) can be bound either by computation or memory resources. The processing-in-memory (PIM) paradigm, where computation is placed near or within memory arrays, is a viable solution to accelerate memory-bound NNs. However, PIM architectures vary in form, where different PIM approaches lead to different tradeoffs. Our goal is to analyze, discuss, and contrast dynamic random-access memory (DRAM)-based PIM architectures for NN performance and energy efficiency. To do so, we analyze three state-of-the-art PIM architectures: 1) UPMEM, which integrates processors and DRAM arrays into a single 2-D chip, 2) Mensa, a 3-D-stacking-based PIM architecture tailored for edge devices, and 3) SIMDRAM, which uses the analog principles of DRAM to execute bit-serial operations. Our analysis reveals that PIM greatly benefits memory-bound NNs: 1) UPMEM provides 23× the performance of a high-end graphics processing unit (GPU) when the GPU requires memory oversubscription for a general matrix–vector multiplication kernel, 2) Mensa improves energy efficiency and throughput by 3.0× and 3.1× over the baseline Edge tensor processing unit for 24 Google edge NN models, and 3) SIMDRAM outperforms a central processing unit/graphics processing unit by 16.7×/1.4× for three binary NNs. We conclude that the ideal PIM architecture for NN models depends on a model's distinct attributes, due to the inherent architectural design choices.

Full Text