Abstract

Recently, the Transformer architecture has achieved state-of-the-art performance in many natural language processing tasks. One key component in the Transformer architecture is the attention layer, which captures the relation between tokens. In this paper, we show that the weight of the attention layer has scale-invariant property, i.e. the output is invariant to a rescaling of weights. However, optimization algorithms in the vector space of weight such as SGD are not scaling invariant. This mismatch will potentially hurt the optimization process. To solve the mismatch, we seek a new parameter space for attention layer that is both scale-invariant and can sufficiently represent the output of attention, so that we can employ optimization algorithms in the scale-invariant parameter space. To achieve this goal, we first show that the output of the attention layer can be represented using scale-invariant variables, which is called paths. Then, we define basis paths which are an independent subset of all paths and are sufficient to represent all other paths. We prove that the Scale-Invariant (SI) space for the attention layer is composed of the basis path. Finally, we design an Attention Basis Path Identification(ABPI) Method to identify the basis paths and propose optimizing the attention layer directly in its SI space. Several experiments on benchmark datasets show that we can obtain more effective neural networks with the attention layer by optimizing the attention layer directly in its SI space.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call