DA-Transformer: Distance-aware Transformer

Chuhan Wu,Yongfeng Huang,Fangzhao Wu

doi:10.18653/v1/2021.naacl-main.166

Abstract

Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually cannot keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or short-term information of these heads. Since the raw weighted real distances may not be optimal for adjusting self-attention weights, we propose a learnable sigmoid function to map them into re-scaled coefficients that have proper ranges. We first clip the raw self-attention weights via the ReLU function to keep non-negativity and introduce sparsity, and then multiply them with the re-scaled coefficients to encode real distance information into self-attention. Extensive experiments on five benchmark datasets show that DA-Transformer can effectively improve the performance of many tasks and outperform the vanilla Transformer and its several variants.

Highlights

TransformerPosition embeddings may not be optimal for distance modeling in Transformer because
Et al, 2019), machine translation (Vaswani et al, 2017), and reading comprehension (Xu et al, 2019)
We propose a learnable sigmoid function to map the weighted distances into re-scaled coefficients with proper ranges for better adjusting the attention weights

Summary

Transformer

Position embeddings may not be optimal for distance modeling in Transformer because. Shaw et al (2018) proposed to add the embeddings of relative positions to the attention key and value to capture the relative distance between two tokens They only kept the precise distance within a certain range by using a threshold to clip the maximum distance to help generalize to long sequences. Yan et al (2019) proposed direction-aware sinusoidal relative position embeddings and used them in a similar way with Transformer-XL They proposed to use the un-scaled attention to better fit the NER task. Different from these methods, we propose to directly re-scale the attention weights based on the mapped relative distances instead of using sinusoidal position embeddings, which can explicitly encode real distance information to achieve more accurate distance modeling

DA-Transformer

Attention Adjustment

Computational Complexity Analysis

Experiments

Performance Evaluation

Methods

Influence of Different Mapping Functions

Influence of Different Attention Adjusting Methods

Findings

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

DA-Transformer: Distance-aware Transformer

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 14	License type: cc-by

Similar Papers

A more realistic approach towards concrete delivery dispatching problem: using real distance instead spatial distance
Júlia Muniz De Miranda Sá ... Mojtaba Maghrebi
Australian Journal of Civil Engineering | VOL. 16
Júlia Muniz De Miranda Sá, et. al.Júlia Muniz De Miranda Sá ... Mojtaba Maghrebi
06 Dec 2017
Australian Journal of Civil Engineering | VOL. 16

Ontological Distance Measures for Information Visualisation on Conceptual Maps
Sylvie Ranwez ... Michel Crampes
-
Sylvie Ranwez, et. al.Sylvie Ranwez ... Michel Crampes
01 Jan 2006
01 Jan 2006

Practical approximate indoor nearest neighbour locating with crowdsourced RSSIs
Jing Sun ... Bin Wang
World Wide Web | VOL. 24
Jing Sun, et. al.Jing Sun ... Bin Wang
01 May 2021
World Wide Web | VOL. 24

Analyses of WSN/UAV network configuration influences on 2.4 GHz IEEE 802.15.4 signal strength
Dalibor Dobrilovic ... Ivana Berkovic
-
Dalibor Dobrilovic, et. al.Dalibor Dobrilovic ... Ivana Berkovic
13 Jul 2021
Analyses of WSN/UAV network configuration influences on 2.4 GHz IEEE 802.15.4 signal strength
Dalibor Dobrilovic ... Ivana Berkovic

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

DA-Transformer: Distance-aware Transformer

Abstract

Highlights

Summary

Talk to us

Similar Papers