Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation

Mijin Go,Joonho Kong,Arslan Munir

doi:10.1109/access.2023.3271640

Abstract

As recent machine translation models are mostly based on the attention-based neural machine translation (NMT), many well-known models such as Transformer or bidirectional encoder representations from Transformers (BERT) have been proposed. Along with algorithmic advancements, hardware acceleration methods for those attention-based neural machine translation models have also been introduced. However, the size of the parameters for attention-based NMT is also becoming larger to guarantee the satisfactory machine translation quality. Among various weights, linearization weights ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$W^{Q}$ </tex-math></inline-formula> , <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$W^{K}$ </tex-math></inline-formula> , <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$W^{V}$ </tex-math></inline-formula> , and <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$W^{O}$ </tex-math></inline-formula> ) account for a non-negligible portion (by up to 30%) among the entire parameters in the modern NMT models. In this paper, we propose a method for linearization weight compression and near-memory hardware decoder for fast and in-situ weight decompression. Our weight compression method exploits the fixed-point quantization along with Huffman coding which is selectively applied depending on the weight value distribution. Our hardware decoder decompresses the Huffman-coded weights near-memory to minimize the weight decoding latency. Our compression method shows 4.9–10.0 compression ratio with small NMT score drops across the five widely used attention-based NMT models (Transformer, Transformer-XL-base, Transformer-XL-large, BERT-base, and BERT-large). In addition, due to the reduced linearization weight size, our proposed method with near-memory decoding enables multi-head attention (MHA) execution latency reduction by 11.8%, on average, as compared to the baseline when considering the weight loading and initialization. In terms of the memory data transfer energy consumption, our proposed method leads to a memory energy saving of 16.1%, on average, as compared to the baseline.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation

Abstract

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Journal: IEEE Access	Publication Date: Jan 1, 2023
License type: CC BY 4.0

Similar Papers

Distilling BERT knowledge into Seq2Seq with regularized Mixup for low-resource neural machine translation
Guanghua Zhang ... Tianyu Guo
Expert Systems With Applications | VOL. 259
Guanghua Zhang, et. al.Guanghua Zhang ... Tianyu Guo
01 Jan 2025
Expert Systems With Applications | VOL. 259

Neural Machine Translation Models with Attention-Based Dropout Layer
Huma Israr ... Jasni Mohamad Zain
Computers, Materials & Continua | VOL. 75
Huma Israr, et. al.Huma Israr ... Jasni Mohamad Zain
01 Jan 2023
Computers, Materials & Continua | VOL. 75

Neural Machine Translation model for University Email Application
Sandhya Aneja ... Nagender Aneja
-
Sandhya Aneja, et. al.Sandhya Aneja ... Nagender Aneja
11 Jul 2020
11 Jul 2020

What do Neural Machine Translation Models Learn about Morphology?
Yonatan Belinkov ... Nadir Durrani
-
Yonatan Belinkov, et. al.Yonatan Belinkov ... Nadir Durrani
01 Jan 2017
01 Jan 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Linearization Weight Compression and In-Situ Hardware-Based Decompression for Attention-Based Neural Machine Translation

Abstract

Talk to us

Similar Papers

More From: IEEE Access