Abstract

The self-attention mechanism is a feature processing mechanism for structured data in deep learning models. It has been widely used in transformer-based deep learning models and has demonstrated superior performance in various fields, such as machine translation, speech recognition, text-to-text conversion, and computer vision. The self-attention mechanism mainly focuses on the surface structure of structured data, but it also involves attention between basic data units and self-attention of basic data units in the deeper structure of the data. In this paper, we investigate the forward attention flow and backward gradient flow in the self-attention module of the transformer model based on the sequence-to-sequence data structure used in machine translation tasks. We found that this combination produces a “gradient distortion” phenomenon at the token level of basic data units. We consider this phenomenon a defect and propose a series of solutions to address it theoretically. Furthermore, we conduct experiments and select the most robust solution as the Unevenness-Reduced Self-Attention (URSA) module, which replaces the original self-attention module. The experimental results demonstrate that the “gradient distortion” phenomenon exists both theoretically and numerically, and the URSA module enables the self-attention mechanism to achieve consistent, stable, and effective optimization across different models, tasks, corpora, and evaluation metrics. The URSA module is both simple and highly portable.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.