High-Performance Method and Architecture for Attention Computation in DNN Inference.

Qi Cheng,Xiaofang Hu,He Xiao,Yue Zhou,Shukai Duan

doi:10.1109/tbcas.2024.3436837

Abstract

In recent years, The combination of Attention mechanism and deep learning has a wide range of applications in the field of medical imaging. However, due to its complex computational processes, existing hardware architectures have high resource consumption or low accuracy, and deploying them efficiently to DNN accelerators is a challenge. This paper proposes an online-programmable Attention hardware architecture based on compute-in-memory (CIM) marco, which reduces the complexity of Attention in hardware and improves integration density, energy efficiency, and calculation accuracy. First, the Attention computation process is decomposed into multiple cascaded combinatorial matrix operations to reduce the complexity of its implementation on the hardware side; second, in order to reduce the influence of the non-ideal characteristics of the hardware, an online-programmable CIM architecture is designed to improve calculation accuracy by dynamically adjusting the weights; and lastly, it is verified that the proposed Attention hardware architecture can be applied for the inference of deep neural networks through Spice simulation. Based on the 100nm CMOS process, compared with the traditional Attention hardware architectures, the integrated density and energy efficiency are increased by at least 91.38 times, and latency and computing efficiency are improved by about 12.5 times.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

High-Performance Method and Architecture for Attention Computation in DNN Inference.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on biomedical circuits and systems

Lead the way for us

Similar Papers

Throughput Maximization of Delay-Aware DNN Inference in Edge Computing by Exploring DNN Model Partitioning and Inference Parallelism
Jing Li ... Weifa Liang
IEEE Transactions on Mobile Computing | VOL. 22
Jing Li, et. al.Jing Li ... Weifa Liang
01 May 2023
IEEE Transactions on Mobile Computing | VOL. 22

Delay-Aware DNN Inference Throughput Maximization in Edge Computing via Jointly Exploring Partitioning and Parallelism
Jing Li ... Weifa Liang
-
Jing Li, et. al.Jing Li ... Weifa Liang
04 Oct 2021
04 Oct 2021

IGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud
Fei Xu ... Ruitao Shang
IEEE Transactions on Parallel and Distributed Systems | VOL. 34
Fei Xu, et. al.Fei Xu ... Ruitao Shang
01 Mar 2023
IEEE Transactions on Parallel and Distributed Systems | VOL. 34

A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm
Ben Keller ... Stephen G Tell
IEEE Journal of Solid-State Circuits | VOL. 58
Ben Keller, et. al.Ben Keller ... Stephen G Tell
01 Apr 2023
A 95.6-TOPS/W Deep Learning Inference Accelerator With Per-Vector Scaled 4-bit Quantization in 5 nm
Ben Keller ... Stephen G Tell

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

High-Performance Method and Architecture for Attention Computation in DNN Inference.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on biomedical circuits and systems