Transformers in computational visual media: A survey

Minxuan Lin,Changsheng Xu,Fan Tang,Yifan Xu,Huapeng Wei,Kekai Sheng,Mengdan Zhang,Yingying Deng,Weiming Dong,Feiyue Huang

doi:10.1007/s41095-021-0247-3

Abstract

Transformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.

Highlights

Convolutional neural networks (CNNs) [1,2,3] have become the fundamental architecture in computational visual media (CVM)
Researchers began to incorporate a self-attention mechanism into CNNs to model long-range relationships, due to the problem of locality of convolutional kernels [4,5,6,7,8]
The experimental values indicate that ViT models have potential to achieve comparable performance or even outperform stateof-the-art CNN architectures like RegNet [3] and EfficientNet [2], which are based on expert-designed basic modules and the power of neural architecture search (NAS) techniques

Summary

Introduction

Convolutional neural networks (CNNs) [1,2,3] have become the fundamental architecture in computational visual media (CVM). Due to the fast development of visual transformer backbones, this survey focuses on the latest works in that area, as well as low-level vision tasks. This study is mainly arranged into four specific fields: backbone design, high-level vision (e.g., object detection and semantic segmentation), lowlevel vision and generation, and multimodal learning. Several latest works are introduced, considering two aspects: (i) injecting convolutional prior knowledge into ViT, and (ii) boosting the richness of visual features. We review some recent representative works on vision-plus-language (V+L) models and summarize pretraining objectives in this field.

Visual transformers

Backbone design

T2T-ViT

Method

Swin Transformer

DeepViT

LocalViT

Comparison on ImageNet

Visualization of ViT

Deformable DETR

UP-DETR

Low-level vision and generation

Uformer

TransGAN

ColTran

GANsformer

StyTr2

Multimodal learning

UNITER

SemVLP

Masked language modeling

Image–text matching

Masked region modeling

Comparisons and implementation details

High-level vision

Findings

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Computational Visual Media	Publication Date: Oct 27, 2021
Citations: 61	License type: open-access

R Discovery Prime

R Discovery Prime

Transformers in computational visual media: A survey

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational Visual Media

Lead the way for us

Similar Papers

Implementing An Image Understanding System Architecture Using Pipe
Randall L Luck
-
Randall L LuckRandall L Luck
22 Mar 1988
22 Mar 1988

DelBERTo: A Deep Lightweight Transformer for Sentiment Analysis
Luca Molinaro ... Enrico Busto
-
Luca Molinaro, et. al.Luca Molinaro ... Enrico Busto
01 Jan 2023
01 Jan 2023

A distributed multi-agent architecture for natural language processing
Danilo Fum ... Carlo Tasso
-
Danilo Fum, et. al.Danilo Fum ... Carlo Tasso
01 Jan 1987
01 Jan 1987

Parallelism in Low-Level Computer Vision — A Review
J K Aggarwal ... Vipin Chaudhary
-
J K Aggarwal, et. al.J K Aggarwal ... Vipin Chaudhary
01 Jan 1989
01 Jan 1989

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Transformers in computational visual media: A survey

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Computational Visual Media