Abstract

Recently, the Transformer architecture has achieved state-of-the-art performance in many natural language processing tasks. One key component in the Transformer architecture is the attention layer, which captures the relation between tokens. In this paper, we show that the weight of the attention layer has scale-invariant property, i.e. the output is invariant to a rescaling of weights. However, optimization algorithms in the vector space of weight such as SGD are not scaling invariant. This mismatch will potentially hurt the optimization process. To solve the mismatch, we seek a new parameter space for attention layer that is both scale-invariant and can sufficiently represent the output of attention, so that we can employ optimization algorithms in the scale-invariant parameter space. To achieve this goal, we first show that the output of the attention layer can be represented using scale-invariant variables, which is called paths. Then, we define basis paths which are an independent subset of all paths and are sufficient to represent all other paths. We prove that the Scale-Invariant (SI) space for the attention layer is composed of the basis path. Finally, we design an Attention Basis Path Identification(ABPI) Method to identify the basis paths and propose optimizing the attention layer directly in its SI space. Several experiments on benchmark datasets show that we can obtain more effective neural networks with the attention layer by optimizing the attention layer directly in its SI space.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.