Large Language Models (LLMs), based on transformer architecture, have demonstrated remarkable capabilities in natural language processing tasks, enabling machines to generate human-like text and engage in meaningful dialogues. However, the exponential increase in model parameters has led to limitations in inference speed and energy efficiency. Compute-in-memory (CIM) technology offers a promising solution to accelerate AI inference by performing analog computations directly within memory, potentially reducing latency and power consumption. At the same time, CIM has been successfully applied to accelerate Convolutional Neural Networks (CNNs); however, the matrix–matrix multiplication (MatMul) operations inherent in the scaled dot-product attention of the transformer present unique challenges for direct CIM implementation. In this work, we propose InMemQK, a compute-in-memory-based attention accelerator that focuses on optimizing MatMul operations through software and hardware co-design. At the software level, InMemQK employs product quantization (PQ) to eliminate data dependencies. At the hardware level, InMemQK integrates energy-efficient time-domain MAC macros for ADC-free computations. Experimental results show InMemQK achieves 13.2×–13.9× lower power consumption than existing CIM-based accelerators.
Read full abstract