Self-Attention with Relative Position Representations

Peter Shaw,Jakob Uszkoreit,Ashish Vaswani

doi:10.18653/v1/n18-2074

Abstract

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graph-labeled inputs.

Highlights

The Transformer (Vaswani et al, 2017) employs an encoder-decoder structure, consisting of stacked encoder and decoder layers
In this work we present an efficient way of incorporating relative position representations in the self-attention mechanism of the Transformer
Absolute position representations, the authors hypothesized that sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training by allowing it to learn to attend by relative position

Summary

Transformer

The Transformer (Vaswani et al, 2017) employs an encoder-decoder structure, consisting of stacked encoder and decoder layers. Decoder layers consist of three sublayers: selfattention followed by encoder-decoder attention, followed by a position-wise feed-forward layer. It uses residual connections around each of the sublayers, followed by layer normalization (Ba et al, 2016). Absolute position representations, the authors hypothesized that sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training by allowing it to learn to attend by relative position. This property is shared by our relative position representations which, in contrast to absolute position representations, are invariant to the total sequence length.

Self-Attention

Relation-aware Self-Attention

Relative Position Representations

Efficient Implementation

Experimental Setup

Machine Translation

Model Variations

Findings

Conclusions

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Self-Attention with Relative Position Representations

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2018
Citations: 1393	License type: cc-by

Similar Papers

Use of spatial context is restricted by relative position in implicit learning
Nobutaka Endo ... Yuji Takeda
Psychonomic Bulletin & Review | VOL. 12
Nobutaka Endo, et. al.Nobutaka Endo ... Yuji Takeda
01 Oct 2005
Psychonomic Bulletin & Review | VOL. 12

Global relative position space based pooling for fine-grained vehicle recognition
Ye Xiang ... Hua Huang
Neurocomputing | VOL. 367
Ye Xiang, et. al.Ye Xiang ... Hua Huang
09 Aug 2019
Neurocomputing | VOL. 367

Working memory for patterned sequences of auditory objects in a songbird
Jordan A Comins ... Timothy Q Gentner
Cognition | VOL. 117
Jordan A Comins, et. al.Jordan A Comins ... Timothy Q Gentner
16 Jul 2010
Cognition | VOL. 117

Our Neural Machine Translation Systems for WAT 2019
Wei Yang ... Jun Ogata
-
Wei Yang, et. al.Wei Yang ... Jun Ogata
01 Jan 2019
Our Neural Machine Translation Systems for WAT 2019
Wei Yang ... Jun Ogata

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Self-Attention with Relative Position Representations

Abstract

Highlights

Summary

Talk to us

Similar Papers