Abstract

High-precision computation with low latency and high energy efficiency is required for AI-driven application and scientific computing. Emerging compute-in-memory (CIM) technology shows a great potential to accelerate multiplication and accumulation (MAC) operations which are frequently executed in such scenarios. Resistive RAM (RRAM) is highly suitable for CIM due to its excellent features such as nonvolatility, small cell size and MAC-friendly structure. However, the existing RRAM CIMs focus on the acceleration of fixed-point/integer operations. Several works adopt the logic-CIM structure to support high-precision Floating-point (FP) calculations, but they require lots of cycles and area to perform a FP operation. To meet the need of low latency and high energy efficiency of widely used FP calculation, we propose an accelerated FP-MAC architecture, based on 40nm RRAM CIM array. A full-parallel data input scheme and triangle weights arrangement is proposed for low latency multi-bits multiplication. A non-uniformly grouped sense amplifiers (NUGSAs) array is adopted for energy and area saving. Experiments show that the proposed FP-MAC design achieves an energy efficiency of up to 8.8 TFLOPS/W at FP8 mode and 3.3 TFLOPS/W at bFP16 mode, and the computing latency is 3.34ns.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call