Recent studies have shown that, due to redundancy, some heads of the Transformer model can be pruned without diminishing the efficiency of the model. In this paper, we propose a constrained optimization algorithm based on Hebbian learning, which trains specific layers in the Transformer architecture in order to enforce diversification between the different heads in the multi-head attention module. The diversification of the heads is achieved through a single-layer feed-forward neural network that is added to the Transformer architecture and is trained with the proposed algorithm. We utilize the algorithm in three different architectural variations of the baseline Transformer model. In addition to the diversification of the heads, the proposed methodology can be used to prune the heads that capture redundant information. Experiments on diverse NLP tasks, including machine translation, text summarization, question answering and large language modeling, show that our proposed approach consistently improves the performance of baseline Transformer models.
Read full abstract