An Automated Quantization Framework for High-Utilization RRAM-Based PIM

Bing Li,Ying Wang,Songyun Qu

doi:10.1109/tcad.2021.3061521

Abstract

With the advancement of deep neural networks (DNNs), the applications driven by DNNs have been spread from the cloud to the edge. However, the intensive computations and data movements in CNNs impede the adoption of DNNs in resource-constraint edge devices. <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Quantization</i> , a common model compression method, has attracted a lot of attention as it enables efficient inference by lowering the data bit-width of CNN parameters. Due to the features of massive storage and computing-in-memory array, resistive memory (RRAM) has established the energy efficiency and small area processing-in-memory (PIM) for the acceleration of DNNs at the edge end. However, when deploying the network onto resistive-memory-based PIM (RRAM-based PIM), there will be tremendous unused cells due to the mismatch between the structure of the neural network layer and memory array, resulting in the resource under-utilization and low computation efficiency. In this work, we observed prior quantization approaches fail to improve hardware resource utilization as they ignored the hardware structure information in RRAM. Thus, combining the information of the neural network model and hardware information is essential for a high-utilization RRAM-based PIM design. Considering the vast model parameters and heterogeneous RRAM crossbar structure, we develop a novel quantization framework by leveraging the AutoML technique, i.e., RaQu, which automatically generates a fine-grained quantization strategy for any model that fully utilizes the resource of RRAM-based PIM. The experimental results show that RaQu achieves at most 29.2%–37.4% and 1.8%–3.3% improvement in resource utilization and model accuracy, respectively, compared to prior coarse-grained quantization methods.

Full Text