Improved Transformer With Multi-Head Dense Collaboration

Huadong Wang,Mei Tu,Zhiyuan Liu,Xin Shen,Yimeng Zhuang

doi:10.1109/taslp.2022.3199648

Huadong Wang, Mei Tu + Show 3 more

https://doi.org/10.1109/taslp.2022.3199648

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Recently, the attention mechanism boosts the performance of many neural network models in Natural Language Processing (NLP). Among the various attention mechanisms, Multi-Head Attention (MHA) is a powerful and popular variant. MHA helps the model to attend to different feature subspaces independently which is an essential component of Transformer. Despite its success, we conjecture that the different heads of the existing MHA may not collaborate properly. To validate this assumption and further improve the performance of Transformer, we study the collaboration problem for MHA in this paper. First, we propose the Single-Layer Collaboration (SLC) mechanism to help each attention head improve its attention distribution based on the feedback of other heads. Furthermore, we extend SLC to the cross-layer Multi-Head Dense Collaboration (MHDC) mechanism. MHDC helps each MHA layer learn the attention distributions considering the knowledge from the other MHA layers. Both SLC and MHDC are implemented as lightweight modules with very few additional parameters. When equipped with these modules, our new framework, i.e., Collaborative TransFormer (<i>CollFormer</i>), significantly outperforms the vanilla Transformer on a range of NLP tasks, including machine translation, sentence semantic relatedness, natural language inference, sentence classification, and reading comprehension. Besides, we also carry out extensive quantitative experiments to analyze the properties of the MHDC in different settings. The experimental results validate the effectiveness and universality of MHDC as well as <i>CollFormer</i>.

Full Text