Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Hyeongu Yun,Kyomin Jung,Taegwan Kang

doi:10.3390/app11041548

Abstract

Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.

Highlights

Since multi-head attention has been introduced by Vaswani et al [1], it has become a standard setting across various Natural Language Processing (NLP) tasks
By analyzing the diversity of representation subspaces, we show that how Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA) reflect the dynamics of inter-head similarity in terms of the numbers of heads
We show an empirical proof that multi-head attention diversifies its representations as the number of heads increases

Summary

Introduction

Since multi-head attention has been introduced by Vaswani et al [1], it has become a standard setting across various Natural Language Processing (NLP) tasks. Voita et al [6] has analyzed that certain heads are respectively sensitive to various linguistic features by using layer-wise relevant propagation These studies imply that there exists diversity of representation subspaces among multiple heads, their analyses are mainly focused on linguistic diversity. Canonical Correlation Analysis (SVCCA) [8] and Centered Kernel Alignment (CKA) [9], as they are recently developed tools to measure similarities of two deep representations Applying these similarity measures, we empirically show that the diversity of multi-head representations does increase as the number of heads increases which is solid evidence supporting the statement of Vaswani et al [1] that the multi-head strategy utilizes diverse representational subspaces. The models with our methods achieve higher performances compared to their baselines in all experiments

Related Works

Multi-Head Attention

Methods for Controlling Inter-Head Diversity

HSIC Regularizer

Orthogonality Regularizer

Drophead

Inter-Head Similarity Analysis

Experimental Details for Similarity Analysis

Applying SVCCA and CKA

Analysis on Inter-Model Similarity

Does Multi-Head Strategy Diversify a Model’s Representation Subspaces?

Experiments on Controlling Inter-Head Similarity Methods

Experimental Details

Analysis on Controlling Inter-Head Diversity

Quantitative Evaluation

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Feb 8, 2021
Citations: 4	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

Research on the Application of Prompt Learning Pretrained Language Model in Machine Translation Task with Reinforcement Learning
Canjun Wang ... Tong Chen
Electronics | VOL. 12
Canjun Wang, et. al.Canjun Wang ... Tong Chen
09 Aug 2023
Electronics | VOL. 12

On the diversity of multi-head attention
Jian Li ... Michael R Lyu
Neurocomputing | VOL. 454
Jian Li, et. al.Jian Li ... Michael R Lyu
17 Apr 2021
Neurocomputing | VOL. 454

Chemical-protein interaction extraction via contextualized word representations and multihead attention.
Yijia Zhang ... Yuanyuan Sun
Database | VOL. 2019
Yijia Zhang, et. al.Yijia Zhang ... Yuanyuan Sun
01 Jan 2019
Database | VOL. 2019

Improving Speech Translation by Cross-Modal Multi-Grained Contrastive Learning
Hao Zhang ... Nianwen Si
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 31
Hao Zhang, et. al.Hao Zhang ... Nianwen Si
01 Jan 2023
IEEE/ACM Transactions on Audio, Speech, and Language Processing | VOL. 31

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences