STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation.

Hao Li,Wei Wang,Mengzhu Wang,Huibin Tan,Long Lan,Zhigang Luo,Xinwang Liu,Kenli Li

doi:10.1109/tnnls.2024.3455551

Abstract

Video instance segmentation (VIS) is a challenging task, requiring handling object classification, segmentation, and tracking in videos. Existing Transformer-based VIS approaches have shown remarkable success, combining encoded features and instance queries as decoder inputs. However, their decoder inputs are low-resolution due to computational cost, resulting in a loss of fine-grained information, sensitivity to background interference, and poor handling of small objects. Moreover, the queries are randomly initialized without location information, hindering convergence efficiency and accurate object instance localization. To address these issues, we propose a novel VIS approach, STFormer, with a spatial-temporal feature aggregation (STFA) module and spatial-temporal-aware Transformer (STT). Specifically, STFA obtains robust high-resolution masked features efficiently for the decoder, while STT's location-guided instance query (LGIQ) improves initial instance queries. STFormer preserves more fine-grained information, improves convergence efficiency, and localizes object instance features accurately. Extensive experiments on YouTube-VIS 2019, YouTube-VIS 2021, and OVIS datasets show that STFormer outperforms mainstream VIS methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems

Lead the way for us

Similar Papers

BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video
Ali Athar ... Deva Ramanan
-
Ali Athar, et. al.Ali Athar ... Deva Ramanan
01 Jan 2023
01 Jan 2023

Object Segmentation in Video Sequences by using Single Frame Processing
Muhammad Hamza Bhatti ... Haseeb Younis
-
Muhammad Hamza Bhatti, et. al.Muhammad Hamza Bhatti ... Haseeb Younis
01 Dec 2019
01 Dec 2019

Video Object Segmentation and Tracking Framework With Improved Threshold Decision and Diffusion Distance
Shao-Yi Chien ... Wei-Kai Chan
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 23
Shao-Yi Chien, et. al.Shao-Yi Chien ... Wei-Kai Chan
01 Jun 2013
IEEE Transactions on Circuits and Systems for Video Technology | VOL. 23

Moving object segmentation and tracking in video
Chun-Ming Li ... Qiu-Ming Li
-
Chun-Ming Li, et. al. Chun-Ming Li ... Qiu-Ming Li
01 Jan 2004
01 Jan 2004

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

STFormer: Spatial-Temporal-Aware Transformer for Video Instance Segmentation.

Abstract

Talk to us

Similar Papers

More From: IEEE transactions on neural networks and learning systems