TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers.

Qianyu Zhou,Yibo Yang,Dacheng Tao,Yunhai Tong,Guangliang Cheng,Lizhuang Ma,Xiangtai Li,Lu He

doi:10.1109/tpami.2022.3223955

Abstract

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on simple yet effective spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of current VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3 %-4 % mAP) on the ImageNet VID dataset. TransVOD yields comparable performances on the benchmark of ImageNet VID. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0 % mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7 % mAP while running at around 30 FPS on a single V100 GPU device. Code and models are available at https://github.com/SJTU-LuHe/TransVOD.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence

Lead the way for us

Journal: IEEE Transactions on Pattern Analysis and Machine Intelligence	Publication Date: Jun 1, 2023
Citations: 52

Similar Papers

End-to-End Video Object Detection with Spatial-Temporal Transformers
Lu He ... Lizhuang Ma
-
Lu He, et. al.Lu He ... Lizhuang Ma
17 Oct 2021
17 Oct 2021

Single Shot Video Object Detector
Jiajun Deng ... Ting Yao
IEEE Transactions on Multimedia | VOL. 23
Jiajun Deng, et. al.Jiajun Deng ... Ting Yao
01 May 2020
IEEE Transactions on Multimedia | VOL. 23

Object Detection in Video with Spatiotemporal Sampling Networks
Gedas Bertasius ... Jianbo Shi
-
Gedas Bertasius, et. al.Gedas Bertasius ... Jianbo Shi
01 Jan 2018
01 Jan 2018

Improved Feature Extraction and Similarity Algorithm for Video Object Detection
Haotian You ... Yufang Lu
Information | VOL. 14
Haotian You, et. al.Haotian You ... Yufang Lu
12 Feb 2023
Information | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers.

Abstract

Talk to us

Similar Papers

More From: IEEE Transactions on Pattern Analysis and Machine Intelligence