Abstract

Transformers were first used for natural language processing (NLP) tasks, but they quickly spread to other deep learning fields, including computer vision. They assess the interdependence of pairs. Attention is a part that enables to dynamically highlight relevant features of the input data (words in the case of text strings, parts of images in the case of visual Transformers). The cost grows continually with the number of tokens. The most common Trans- former Architecture for image classification uses only the Transformer Encoder to transform the various input tokens. However, the decoder component of the traditional Transformer Architecture is also used in a variety of other applications. In this section, we first introduce the Attention Mechanism (Section 1), followed by the Basic Transformer Block, which includes the Vision Transformer (Section 2).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call