SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Ruohan Wu,Xianyu Zhu,Junshi Chen,Sha Liu,Tianyu Zheng,Xin Liu,Hong An

doi:10.1007/s11227-024-05890-8

Abstract

In the past few years, Transformer-based large language models (LLM) have become the dominant technology in a series of applications. To scale up the sequence length of the Transformer, FlashAttention is proposed to compute exact attention with reduced memory requirements and faster execution. However, implementing the FlashAttention algorithm on the new generation Sunway Supercomputer faces many constraints such as the unique heterogeneous architecture and the limited memory bandwidth. This work proposes SWattention, a highly efficient method for computing the exact attention on the SW26010pro processor. To fully utilize the 6 core groups (CG) and 64 cores per CG on the processor, we design a two-level parallel task partition strategy. Asynchronous memory access is employed to ensure that memory access overlaps with computation. Additionally, a tiling strategy is introduced to determine optimal SRAM block sizes. Compared with the standard attention, SWattention achieves around 2.0x speedup for FP32 training and 2.5x speedup for mixed-precision training. The sequence lengths range from 1k to 8k and scale up to 16k without being out of memory. As for the end-to-end performance, SWattention achieves up to 1.26x speedup for training GPT-style models, which demonstrates that SWattention enables longer sequence length for LLM training.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing

Lead the way for us

Journal: The Journal of Supercomputing	Publication Date: Mar 11, 2024
License type: CC BY 4.0

Similar Papers

Commonsense Knowledge in Foundation and Large Language Models
Harsh Bhardwaj ... Maniya Tadhiyal
International Journal of Advanced Research in Science, Communication and Technology | VOL. -
Harsh Bhardwaj, et. al. Harsh Bhardwaj ... Maniya Tadhiyal
08 Feb 2024
International Journal of Advanced Research in Science, Communication and Technology | VOL. -

Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications.
Rajesh Bhayana
Radiology | VOL. 310
Rajesh BhayanaRajesh Bhayana
01 Jan 2024
Radiology | VOL. 310

SelfCP: Compressing over-limit prompt via the frozen large language model itself
Jun Gao ... Wenjie Li
Information Processing and Management | VOL. 61
Jun Gao, et. al.Jun Gao ... Wenjie Li
30 Aug 2024
Information Processing and Management | VOL. 61

Application of Transformer-Based Language Models to Detect Hate Speech in Social Media
Swapnanil Mukherjee ... Sujit Das
Journal of Computational and Cognitive Engineering | VOL. 2
Swapnanil Mukherjee, et. al.Swapnanil Mukherjee ... Sujit Das
17 Dec 2021
Journal of Computational and Cognitive Engineering | VOL. 2

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

SWattention: designing fast and memory-efficient attention for a new Sunway Supercomputer

Abstract

Talk to us

Similar Papers

More From: The Journal of Supercomputing