Abstract

Abstract Multi-head attention, a collection of several attention mechanisms that independently attend to different parts of the input, is the key ingredient in the Transformer. Recent work has shown, however, that a large proportion of the heads in a Transformer’s multi-head attention mechanism can be safely pruned away without significantly harming the performance of the model; such pruning leads to models that are noticeably smaller and faster in practice. Our work introduces a new head pruning technique that we term differentiable subset pruning. ntuitively, our method learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads. he importance variables are learned via stochastic gradient descent. e conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.1

Highlights

  • The Transformer (Vaswani et al, 2017) has become one of the most popular neural architectures used in NLP

  • We conduct experiments on natural language inference and machine translation; we show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level

  • In layer 1 ≤ l ≤ L, Hl different attention mechanisms are applied in parallel; importantly, it is this parallelism that has lead to the rise of the Transformer—it is a more efficient architecture in practice so it can be trained on more data

Read more

Summary

Introduction

The Transformer (Vaswani et al, 2017) has become one of the most popular neural architectures used in NLP. The key ingredient in the Transformer architecture is the multi-head attention mechanism, which is an assembly of multiple attention functions (Bahdanau et al, 2015) applied in parallel. In layer 1 ≤ l ≤ L, Hl different attention mechanisms are applied in parallel; importantly, it is this parallelism that has lead to the rise of the Transformer—it is a more efficient architecture in practice so it can be trained on more data. Each individual attention mechanism is referred to as a head; multi-head attention is the simultaneous application of multiple attention heads in a single architecture. Given the head importance scores ιh, suppose we would like to sample a subset J of size 1 according to the following distribution p(J

Objectives
Methods
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call