Abstract

Self-attention is the core idea of the transformer, a kind of special structure for models to understand sentences and texts. Transformer is growing fast, but the model's internal unknowns are still out of control. In this work, the research visualizes self-attention and observes those self-attentions in some transformers. Through observation, there are five types of self-attention connections. The research classifies them as Parallel self-attention head, Radioactive self-attention head, Homogeneous self-attention head, X-type self-attention head, and Compound self-attention head. The Parallel self-attention head is the most important. The combination of different types will affect the performance of the transformer. Visualizations can indicate the location of different types. The results show that some homogeneous heads should be more varied in that case the model will perform better. A new training method is called local head training method, and the local training method may be useful during training transformer. The purpose of this study is to lay the foundation for model biology, to take other perspectives to understand transformers, and to fine-tune training methods.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.