Abstract

The computational load, for accurate AI workloads, is moving from large server clusters to edge devices; thus enabling richer and more personalized AI applications. Compute-in-memory (CIM) is beneficial for edge-AI workloads, specifically ones that are MAC-intensive. However, realizing better power-performance-area (PPA) and high accuracy is a major challenge for practical CIM implementation. Recent work examined tradeoffs between MAC throughput, energy efficiency and accuracy for analog based CIM [1–3]. On the other hand, digital-CIMs (DCIM), which use small, distributed SRAM banks and a customized MAC unit, have demonstrated massively-parallel computation with no accuracy loss and a higher PPA with technology scaling [4]. In this paper, we introduce a 4-nm SRAM-based DCIM macro that handles variable 8/12b-inteteger weights and 8/12/16b-integer inputs in a single macro. The proposed 8-transistor 2b OAI (or-and-invert) cell achieves a 11 % smaller combined bit cell and multiplier area, and supports ultra-low voltage operation, down to 0.32V. Furthermore, the signed-extended carry-look-ahead adder (signed-CLA) and an adder tree pipeline are introduced to boost throughput. Figure 7.4.1 shows the implementation of the bit cell structure, and a neural network accuracy comparison with various bit precisions. Since we targeted concurrent write and MAC operations, ping-pong for weight updates and MAC operations, array needs to have an even number of rows: a classical approach is to use two 12T bit cells and a 2-input NOR. The 12T cell supports simultaneous read and write operations, as its read- and write-port are independent. The 2-input NOR is used for bitwise multiplication with input activations (XIN) and weights (W). On the other hand, two 8T cells and an OAI is used in the proposed SRAM-based DCIM macro. In the proposed bitcell topology, the 8T bitcells act as memory data storage and row selection for the write operation. The OAI performs row selection and bitwise multiplication for the MAC operation. The signals for row selection and multiplication are generated by the logic in read WL driver (RWLDRV), and are propagated to the OAls as RWLBs. Compared to the two 12T cell and NOR2, the area required for data storge and bitwise multiplication is 11 % smaller. In addition, three signal tracks in the vertical direction is reduced by adopting proposed bitcell topology, because the OAI logic removes two RWLs and XINB. We also compared the network accuracy with various integer bit precisions, INT16, INT12 and INT8, for different workloads: MobileNet_v2 & ResNet-50, pre-trained with ImageNet. Quantization and de-quantization are introduced prior to and after the compute convolution and fully-connected (CONV/FC) layer, the MAC is performed with integer format in the DCIM. During evaluation the accuracy difference between a MAC operation using FP16 and a MAC operation using INT16 is around 0.2%. We also observed that INT12 results in a negligible accuracy loss, < 0.1 %, compared to INT16. INT8 yields a 1,..,2 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">%</sup> accuracy loss, but achieves a higher energy efficiency. The energy efficiency is 1. <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$8\times$</tex> using INT12 and <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$4\times$</tex> for INT8, compared to INT16. To support higher-accuracy and higher energy-efficiency flexibility, we implement flexible bit width support in our DCIM macro design.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call