Abstract

Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.

Highlights

  • Since multi-head attention has been introduced by Vaswani et al [1], it has become a standard setting across various Natural Language Processing (NLP) tasks

  • By analyzing the diversity of representation subspaces, we show that how Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA) reflect the dynamics of inter-head similarity in terms of the numbers of heads

  • We show an empirical proof that multi-head attention diversifies its representations as the number of heads increases

Read more

Summary

Introduction

Since multi-head attention has been introduced by Vaswani et al [1], it has become a standard setting across various Natural Language Processing (NLP) tasks. Voita et al [6] has analyzed that certain heads are respectively sensitive to various linguistic features by using layer-wise relevant propagation These studies imply that there exists diversity of representation subspaces among multiple heads, their analyses are mainly focused on linguistic diversity. Canonical Correlation Analysis (SVCCA) [8] and Centered Kernel Alignment (CKA) [9], as they are recently developed tools to measure similarities of two deep representations Applying these similarity measures, we empirically show that the diversity of multi-head representations does increase as the number of heads increases which is solid evidence supporting the statement of Vaswani et al [1] that the multi-head strategy utilizes diverse representational subspaces. The models with our methods achieve higher performances compared to their baselines in all experiments

Related Works
Multi-Head Attention
Methods for Controlling Inter-Head Diversity
HSIC Regularizer
Orthogonality Regularizer
Drophead
Inter-Head Similarity Analysis
Experimental Details for Similarity Analysis
Applying SVCCA and CKA
Analysis on Inter-Model Similarity
Does Multi-Head Strategy Diversify a Model’s Representation Subspaces?
Experiments on Controlling Inter-Head Similarity Methods
Experimental Details
Analysis on Controlling Inter-Head Diversity
Quantitative Evaluation
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call