Learning Adaptive Axis Attentions in Fine-tuning: Beyond Fixed Sparse Attention Patterns

Ani Nenkova ,Zihan Wang ,Jingbo Shang ,Handong Zhao ,Jason Kuen ,Jiuxiang Gu ,Ruiyi Zhang ,Tong Sun ,Vlad I Morariu

doi:10.48448/qtj4-pw30

Ani Nenkova , Zihan Wang + Show 7 more

https://doi.org/10.48448/qtj4-pw30

Copy DOI

Export

Save

Cite

Publication Date: May 7, 2022

Abstract
Full-Text
Similar Papers

Abstract

Listen

We present a comprehensive study of sparse attention patterns in Transformer models. We first question the need for pre-training with sparse attention and present experiments showing that an efficient fine-tuning only approach yields a slightly worse but still competitive model. Then we compare the widely used local attention pattern and the less-well-studied global attention pattern, demonstrating that global patterns have several unique advantages. We also demonstrate that a flexible approach to attention, with different patterns across different layers of the model, is beneficial for some tasks. Drawing on this insight, we propose a novel Adaptive Axis Attention method, which learns—during fine-tuning—different attention patterns for each Transformer layer depending on the downstream task. Rather than choosing a fixed attention pattern, the adaptive axis attention method identifies important tokens—for each task and model layer—and focuses attention on those. It does not require pre-training to accommodate the sparse patterns and demonstrates competitive and sometimes better performance against fixed sparse attention patterns that require resource-intensive pre-training.

Full Text