Abstract

In Semi-supervised Video Object Segmentation (SVOS), there is a critical emphasis on enhancing the memory and readout mechanisms for frame matching, especially in relation to temporal dynamics. Current methods predominantly use 2D CNNs for encoding video frames, which unfortunately neglects the crucial aspect of addressing temporal variations in individual frames and their associated masks during the encoding process. One potential solution would be to implement temporal models such as 3D CNNs instead of 2D CNNs, but this significantly increases computational requirements, making it impractical for real-world SVOS applications. In this paper, we introduce the Grouped Temporal Recalibration with Attention for Convolutional Encoders (G-TRACE), a novel plug-and-play module that is compatible with various existing SVOS frameworks. G-TRACE uses hierarchical memory-centric attention and integrates effortlessly with 2D CNNs, offering a novel approach to temporal modeling that operates orthogonally to traditional frame matching methods. Extensive evaluations on four widely-used benchmarks demonstrate that our method consistently delivers significant performance improvements over various baseline models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call