Abstract

Transformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.

Highlights

  • Convolutional neural networks (CNNs) [1,2,3] have become the fundamental architecture in computational visual media (CVM)

  • Researchers began to incorporate a self-attention mechanism into CNNs to model long-range relationships, due to the problem of locality of convolutional kernels [4,5,6,7,8]

  • The experimental values indicate that ViT models have potential to achieve comparable performance or even outperform stateof-the-art CNN architectures like RegNet [3] and EfficientNet [2], which are based on expert-designed basic modules and the power of neural architecture search (NAS) techniques

Read more

Summary

Introduction

Convolutional neural networks (CNNs) [1,2,3] have become the fundamental architecture in computational visual media (CVM). Due to the fast development of visual transformer backbones, this survey focuses on the latest works in that area, as well as low-level vision tasks. This study is mainly arranged into four specific fields: backbone design, high-level vision (e.g., object detection and semantic segmentation), lowlevel vision and generation, and multimodal learning. Several latest works are introduced, considering two aspects: (i) injecting convolutional prior knowledge into ViT, and (ii) boosting the richness of visual features. We review some recent representative works on vision-plus-language (V+L) models and summarize pretraining objectives in this field.

Visual transformers
Backbone design
T2T-ViT
Method
Swin Transformer
DeepViT
LocalViT
Comparison on ImageNet
Visualization of ViT
Deformable DETR
UP-DETR
Low-level vision and generation
Uformer
TransGAN
ColTran
GANsformer
StyTr2
Multimodal learning
UNITER
SemVLP
Masked language modeling
Image–text matching
Masked region modeling
Comparisons and implementation details
High-level vision
Findings
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call