15.2 A 28nm 64Kb Inference-Training Two-Way Transpose Multibit 6T SRAM Compute-in-Memory Macro for AI Edge Chips

Chun–Jen Liu ,Yen-Chi Chou,Jian‐Wei Su ,Shimeng Yu,Ruhui Liu,Shih-Chieh Chang,Ren-Shuo Liu,Zhixiao Zhang,Pei-Jung Lu,Kea-Tiong Tang,Hongwu Jiang,Ting-Wei Chang,Chih-Cheng Hsieh,Yung-Ning Tu,Meng‐Fan Chang ,Jinghong Wang ,Xin Si,Wei-Hsing Huang,Shanshi Huang,Heng-Yuan Lee,Shyh-Shyuan Sheu,Sih-Han Li,Chung-Chuan Lo

doi:10.1109/isscc19947.2020.9062949

Abstract

Many Al edge devices require local intelligence to achieve fast computing time (t AC ), high energy efficiency (EF), and privacy. The transfer-learning approach is a popular solution for Al edge chips, wherein data used to re-train the Al in the cloud is used to fine-tune (re-train) a few of the neural layers in edge devices. This enables the dynamic incorporation of data from in-situ environments or private information. Computing-in-memory (CIM) is a promising approach to improve EF for Al edge chips, existing CIM schemes support inference [1]–[5] with forward (FWD) propagation; however, they do not support training, requiring both FWD and backward (BWD) propagation, due to differences in weight-access flow for FWD and BWD propagation. As Fig. 15.2.1 shows, efforts to increase the precision of the input (IN), weight (W), and/or output (OUT) tend to degrade r AC and EF for training operations irrespective of scheme: digital FWD and BWD (DF-DB) or CIM-FWD-digital-BWD (CiMF-DB). This work develops a two-way transpose (TWT) SRAM-CIM macro supporting multibit MAC operations for FWD and BWD propagation with fast r AC and high EF within a compact area. The proposed scheme features (1) A TWT multiply cell (TWT-MC) with a high resistance to process variation; and (2) a small-offset gain-enhancement sense amplifier (SOGE-SA) to tolerate a small read margin. A 28nm 64Kb TWT SRAM-CIM macro was fabricated using a foundry-provided compact 6T-SRAM cell for SRAM-CIM devices supporting both inference and training operations for the first time. This macro also demonstrates the fastest t AC (3.8 – 21ns) and highest EF (7 – 61.1TOPS/w) for MAC operations using 2 – 8b inputs, 4 – 8b weights and 12 − 20b outputs.

Full Text