A 4nm 6163-TOPS/W/b $\mathbf{4790-TOPS/mm^{2}/b}$ SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update

Haruki Mori,Yih Wang,Chao-Kai Chuang,Chia-Fu Lee,Takeshi Hashizume,Yao-Yi Liu,Kerem Akarvardar,Tsung-Yung Jonathan Chang,Yu-Der Chih,Tan-Li Chou,Hung-Jen Liao,Yu-Hao Hsu,Hidehiro Fujiwara,Wei-Chang Zhao,Yen-Huei Chen,Shin-Rung Wu,Hao-Chun Tung,Cheng-En Lee

doi:10.1109/isscc42615.2023.10067555

Abstract

The computational load, for accurate AI workloads, is moving from large server clusters to edge devices; thus enabling richer and more personalized AI applications. Compute-in-memory (CIM) is beneficial for edge-AI workloads, specifically ones that are MAC-intensive. However, realizing better power-performance-area (PPA) and high accuracy is a major challenge for practical CIM implementation. Recent work examined tradeoffs between MAC throughput, energy efficiency and accuracy for analog based CIM [1–3]. On the other hand, digital-CIMs (DCIM), which use small, distributed SRAM banks and a customized MAC unit, have demonstrated massively-parallel computation with no accuracy loss and a higher PPA with technology scaling [4]. In this paper, we introduce a 4-nm SRAM-based DCIM macro that handles variable 8/12b-inteteger weights and 8/12/16b-integer inputs in a single macro. The proposed 8-transistor 2b OAI (or-and-invert) cell achieves a 11 % smaller combined bit cell and multiplier area, and supports ultra-low voltage operation, down to 0.32V. Furthermore, the signed-extended carry-look-ahead adder (signed-CLA) and an adder tree pipeline are introduced to boost throughput. Figure 7.4.1 shows the implementation of the bit cell structure, and a neural network accuracy comparison with various bit precisions. Since we targeted concurrent write and MAC operations, ping-pong for weight updates and MAC operations, array needs to have an even number of rows: a classical approach is to use two 12T bit cells and a 2-input NOR. The 12T cell supports simultaneous read and write operations, as its read- and write-port are independent. The 2-input NOR is used for bitwise multiplication with input activations (XIN) and weights (W). On the other hand, two 8T cells and an OAI is used in the proposed SRAM-based DCIM macro. In the proposed bitcell topology, the 8T bitcells act as memory data storage and row selection for the write operation. The OAI performs row selection and bitwise multiplication for the MAC operation. The signals for row selection and multiplication are generated by the logic in read WL driver (RWLDRV), and are propagated to the OAls as RWLBs. Compared to the two 12T cell and NOR2, the area required for data storge and bitwise multiplication is 11 % smaller. In addition, three signal tracks in the vertical direction is reduced by adopting proposed bitcell topology, because the OAI logic removes two RWLs and XINB. We also compared the network accuracy with various integer bit precisions, INT16, INT12 and INT8, for different workloads: MobileNet_v2 & ResNet-50, pre-trained with ImageNet. Quantization and de-quantization are introduced prior to and after the compute convolution and fully-connected (CONV/FC) layer, the MAC is performed with integer format in the DCIM. During evaluation the accuracy difference between a MAC operation using FP16 and a MAC operation using INT16 is around 0.2%. We also observed that INT12 results in a negligible accuracy loss, < 0.1 %, compared to INT16. INT8 yields a 1,..,2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">%</sup> accuracy loss, but achieves a higher energy efficiency. The energy efficiency is 1. <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$8\times$</tex> using INT12 and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$4\times$</tex> for INT8, compared to INT16. To support higher-accuracy and higher energy-efficiency flexibility, we implement flexible bit width support in our DCIM macro design.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A 4nm 6163-TOPS/W/b $\mathbf{4790-TOPS/mm^{2}/b}$ SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

The Quantitative Comparisons of Analog and Digital SRAM Compute-In-Memories for Deep Neural Network Applications
Joonhyung Kim ... Jongsun Park
-
Joonhyung Kim, et. al.Joonhyung Kim ... Jongsun Park
19 Oct 2022
19 Oct 2022

In Situ Storing 8T SRAM-CIM Macro for Full-Array Boolean Logic and Copy Operations
Zhiting Lin ... Tian Xu
IEEE Journal of Solid-State Circuits | VOL. 58
Zhiting Lin, et. al.Zhiting Lin ... Tian Xu
01 May 2023
IEEE Journal of Solid-State Circuits | VOL. 58

15.2 A 28nm 64Kb Inference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-Memory Macro for AI Edge Chips
...
-
, et. al. ...
01 Feb 2020
15.2 A 28nm 64Kb Inference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-Memory Macro for AI Edge Chips
...

A 28nm Horizontal-Weight-Shift and Vertical-feature-Shift-Based Separate-WL 6T-SRAM Computation-in-Memory Unit-Macro for Edge Depthwise Neural-Networks
Bo Wang ... Zhaoyang Zhang
-
Bo Wang, et. al.Bo Wang ... Zhaoyang Zhang
19 Feb 2023
19 Feb 2023

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A 4nm 6163-TOPS/W/b $\mathbf{4790-TOPS/mm^{2}/b}$ SRAM Based Digital-Computing-in-Memory Macro Supporting Bit-Width Flexibility and Simultaneous MAC and Weight Update

Abstract

Talk to us

Similar Papers