Abstract

Pre-trained Transformer-based neural language models, such as BERT, have achieved remarkable results on varieties of NLP tasks. Recent works have shown that attention-based models can benefit from more focused attention over local regions. Most of them restrict the attention scope within a linear span, or confine to certain tasks such as machine translation and question answering. In this paper, we propose a syntax-aware local attention, where the attention scopes are restrained based on the distances in the syntactic structure. The proposed syntax-aware local attention can be integrated with pretrained language models, such as BERT, to render the model to focus on syntactically relevant words. We conduct experiments on various single-sentence benchmarks, including sentence classification and sequence labeling tasks. Experimental results show consistent gains over BERT on all benchmark datasets. The extensive studies verify that our model achieves better performance owing to more focused attention over syntactically relevant words.

Highlights

  • Transformer (Vaswani et al, 2017) has performed remarkably well, standing on the multiheaded dot-product attention which fully takes into account the global contextualized information

  • Several studies find that self-attention can be enhanced by local attention, where the attention scopes are restricted to important local regions

  • We propose a syntax-aware local attention (SLA) which is adaptable to several tasks, and integrate it with BERT (Devlin et al, 2019)

Read more

Summary

Introduction

Transformer (Vaswani et al, 2017) has performed remarkably well, standing on the multiheaded dot-product attention which fully takes into account the global contextualized information. We propose a syntax-aware local attention (SLA) which is adaptable to several tasks, and integrate it with BERT (Devlin et al, 2019). We first apply dependency parsing to the input text, and calculate the distances of input words to construct the self-attention masks. The local attention scores are calculated by applying these masks to the dot-product attention. We incorporate the syntax-aware local attention with the Transformer global attention. We find that the syntax-aware local attention is more involved in the aggregation of local and global attention. The attention visualization validates the syntactic information supports to capture important local regions. This paper makes the following contributions: i) SLA can capture the information of important local regions on the syntactic structure.

Transformer Attention
Local Attentions
Approach
Syntax-aware Local Attention
Attention Aggregation
Experimental Setup
Main Results
Conclusion
Training Procedure
Implementation Details
Testing on Chinese Benchmarks

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.