How Does Selective Mechanism Improve Self-Attention Networks?

Xinwei Geng,Longyue Wang,Zhaopeng Tu,Bing Qin,Ting Liu,Xing Wang

doi:10.18653/v1/2020.acl-main.269

Abstract

Self-attention networks (SANs) with selective mechanism has produced substantial improvements in various NLP tasks by concentrating on a subset of input words. However, the underlying reasons for their strong performance have not been well explained. In this paper, we bridge the gap by assessing the strengths of selective SANs (SSANs), which are implemented with a flexible and universal Gumbel-Softmax. Experimental results on several representative NLP tasks, including natural language inference, semantic role labelling, and machine translation, show that SSANs consistently outperform the standard SANs. Through well-designed probing experiments, we empirically validate that the improvement of SSANs can be attributed in part to mitigating two commonly-cited weaknesses of SANs: word order encoding and structure modeling. Specifically, the selective mechanism improves SANs by paying more attention to content words that contribute to the meaning of the sentence. The code and data are released at https://github.com/xwgeng/SSAN.

Highlights

Self-attention networks (SANs) (Lin et al, 2017) have achieved promising progress in various natural language processing (NLP) tasks, including machine translation (Vaswani et al, 2017), natural language inference (Shen et al, 2018b), semantic role labeling (Tan et al, 2018; Strubell et al, 2018) and language representation (Devlin et al, 2019)
Despite SANs have demonstrated its effectiveness on various NLP tasks, recent studies empirically revealed that SANs suffer from two representation limitations of modeling word order encoding (Yang et al, 2019a) and syntactic structure modeling (Tang et al, 2018)
We attribute the improvement to the strengths of selective SANs (SSANs) on word order encoding and structure modeling, which are empirically validated in Sections 4 and 5

Summary

Introduction

Self-attention networks (SANs) (Lin et al, 2017) have achieved promising progress in various natural language processing (NLP) tasks, including machine translation (Vaswani et al, 2017), natural language inference (Shen et al, 2018b), semantic role labeling (Tan et al, 2018; Strubell et al, 2018) and language representation (Devlin et al, 2019). There has been a growing interest in integrating selective mechanism into SANs, which has produced substantial improvements in a variety of NLP tasks. Some researchers incorporated a hard constraint into SANs to select a subset of input words, on top of which self-attention is conducted (Shen et al, 2018c; Hou et al, 2019; Yang et al, 2019b). Shen et al (2018c) incorporated reinforced sampling to dynamically choose a subset of input elements, which are fed to SANs Some researchers incorporated a hard constraint into SANs to select a subset of input words, on top of which self-attention is conducted (Shen et al, 2018c; Hou et al, 2019; Yang et al, 2019b). Yang et al (2018) and Guo et al (2019) proposed a soft mechanism by imposing a learned Gaussian bias over the original attention distribution to enhance its ability of capturing local contexts. Shen et al (2018c) incorporated reinforced sampling to dynamically choose a subset of input elements, which are fed to SANs

Methods

Results

Conclusion