A 1ynm 1.25V 8Gb 16Gb/s/Pin GDDR6-Based Accelerator-in-Memory Supporting 1TFLOPS MAC Operation and Various Activation Functions for Deep Learning Application

Daehan Kwon ,Junyeol Jeon,Hyunha Joo,Joonhong Park,Gi-Moon Hong,Nahsung Kim,Younggun Jun,Kyuyoung Kim,Junhyun Chun,Woojae Shin,Vladimir Kornijcuk,Ilwoong Kim,Yongkee Kwon,Dongkyun Kim,Byeongju An,Guhyun Kim,Seongju Lee,Donguc Ko,Dongyoon Ka,Choungki Song,Jeongje Park,Haerang Choi,Euicheol Lim,Jongsoon Won ,Kyeong-Pil Kang ,Chunseok Jeong ,Il Park ,Se‐Ho Kim ,Kyu-Dong Hwang ,Sanghoon Oh ,Jae‐Wook Lee ,Jungyeon Kim ,Min-Kyu Lee ,Chan-Wook Park ,Il Kon Kim ,Jieun Jang ,Joohwan Cho

doi:10.1109/jssc.2022.3200718

Abstract

In this article, a 1.25-V 8-Gb, 16-Gb/s/pin GDDR6-based accelerator-in-memory (AiM) is presented. A dedicated command (CMD) set for deep learning (DL) is introduced to minimize latency when switching operation modes, and a bank-wide mantissa shift (BWMS) scheme is adopted to minimize calculation delay time, current consumption, and circuit area during multiply-accumulate (MAC) operation. By storing the lookup table (LUT) in the reserved word line in the dynamic random access memory (DRAM) bank cell, it is possible to support various activation functions (AFs), such as Gaussian error linear unit (GELU), sigmoid, and Tanh as well as rectified linear unit (ReLU) and Leaky ReLU. Performance evaluation was conducted by measuring the fabricated chip in ATE and a self-manufactured field-programmable gate array (FPGA)-based system. In the ATE-level evaluation, it operates at 16 Gbps up to a voltage as low as 1.10 V. When evaluated by GEMV and MNIST in the FPGA-based system, it was confirmed that the performance gains of 7.5–10.5 times were possible compared to the HBM2-based or GDDR6-based systems.

Full Text