Training a CNN involves computationally intense optimization algorithms to fit the network using a training dataset, to update the network weight for inferencing and then pattern classification. Hence, the application of in-memory computation would enable a highly power-efficient low latency on-the-edge CNN training technique by avoiding the memory wall created during the external memory read/write operation (for off chip instruction and data transfer). A memory write-verify, and re-program technique can control the RRAM variability. Still, memory verification and re-program is a complex process with additional resources needed for practical implementation of verification circuit. In this study, we have demonstrated a practical (First-in Max-Out) FIMO-based cache memory called Maximum Count Binary Comparator Layer (MCBC), using 1T3R, 1T5R, and 1T7R RRAM structures by using a probability-based accuracy improvement architecture, without the conventional verification process. We constructed 10 layered modified MobileNET with filter size ranging from 32 - 512 and trained with Traffic Sign Recognition Database (TSRD) using a three-tier abstraction simulation learning framework - (1) High level, 10 layered CNN implementation with Python+TensorFlow; (2) Verilog HDL based FP32MUL and FP32ADD (32-bits Floating Point adder and multiplier) circuits constructed with RRAM NAND gates using 1T2R structures; and (3) Digital Look-Up-Table (LUT) model for RRAM variability. An edge learning framework (for the forward pass) is demonstrated using digital RRAM-NAND/NOR universal gates integrated with the Maximum Count Binary Comparator Layer (MCBC) to partially circumvent the impact of RRAM variability and to quantify the RRAM variability on the CNN training prediction accuracy for 65nm CMOS OxRAM (TiN/HfO <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sub> /Hf/TiN) with varying device current compliance of 5, 10, and 50μA for low power IoT applications. The MCBC layer was simulated using a SPICE model, for which the estimated chip layout is 1150 × 1230 nm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> per logical gate input, which resulted in an overall prediction accuracy improvement from 10% to 60% by repeating the logical operations of the NOR gate for {1, 3, 5, and 7} cycles respectively.
Read full abstract