Divided attention

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Researchers from Nanjing University of Information Science and Technology (NUIST) present an attention-modulating network for video object segmentation with an advanced attention modulator to efficiently modulate a segmentation model to focus on a specific object of interest. The group employ a focal loss that distinguishes simple samples from more difficult ones to accelerate the convergence of network training to achieve state-of-the-art segmentation performance. Video object segmentation (VOS) is a fundamental task in computer vision, with important applications in video editing, robotics, and self-driving cars. VOS tasks are mainly categorised into unsupervised and semi-supervised classifications. The former seeks to find and segment the salient targets in the videos completely without supervision, with the algorithm itself deciding what the main segmentation is. The latter aims at segmenting an object instance throughout the entire video sequence given only the object mask on the first frame. This can be observed as a pixel-level object tracking problem. Semi-supervised VOS can be subdivided into single-object segmentation and multi-object segmentation. In the team's Letter, they focus on semi-supervised VOS. Deep learning for VOS has gained attention in the research community in recent years. Existing semi-supervised VOS techniques work by constructing deep networks and fine-tuning a pre-trained classifier on a given ground truth in the first frame during online testing. This online fine-tuning of a classifier during testing has been shown to significantly improve accuracy. Illustrative diagram of the proposed segmentation model and approach. Segmentation results. The team conduct an attention-modulating network for the semi-supervised VOS task. Co-author Kaihua Zhang elaborates on the process: “We designed an efficient visual and spatial attention modulator based on the semantic information of the annotated object in the first frame and the spatial information of predicted object mask in the previous frame, respectively, to fast module the segmentation model to focus on the specific object of interest. Then we design a SCAM architecture which includes a channel attention module and a spatial attention module and inject it into segmentation model to further refine its feature maps. In addition, we construct a feature pyramid attention module to mine context information of different scales to solve the problem of multi-scale segmentation. Most existing methods rely on fine-tuning models using first-frame annotations and are time-consuming, making them unsuitable for most practical applications. To address this issue, the proposed approach developed an attention-modulating network to focus on the appearance of a specific object instance in one single feed-forward pass without fine-tuning. Compared with other methods, this method has achieved state-of-art performance on the DAVIS2017 dataset by using attention-modulators, feature attention pyramid modules and focal loss. In order to overcome a sample imbalance problem, reference was made to focal loss which can accelerate the convergence of network training, thus helping to distinguish between difficult and simple samples. VOS remains challenging due to occlusions, fast motion, deformation, and significant appearance variations over time. This method conducts a visual attention modulator to extract semantic information such as category, color and shape from the first frame. The spatial attention modulator fits the predicted location of object masks in the previous frame as a spatial prior to guide the segmentation network to focus on the regions where that target is most likely to appear in the current frame. To solve the multi-scales of segmentation objects, feature pyramid attention modules mined the context information of different scales, achieving better pixel-level attention for the high-level feature maps. The proposed VOS approach is fast, which facilitates many applications, such as interactive video editing and augmented reality. It may be applied to video understanding models in the short term, and after long-term development, it may be applied to robotics, and self-driving cars. Kaihua Zhang notes on his groups future work: “Experiments show that our algorithm performs erroneous instance segmentation when faced with the challenge of occluding each other between similar objects. To tackle this problem, we will leverage a position-sensitive embedding which is capable of distinguishing the pixels of similar objects. We have also found that solving VOS with multiple instances requires template matching to deal with occlusion and temporal propagation to ensure temporal continuity; otherwise the segmentation instance would be lost. Thus, we will use the re-identification module to retrieve lost instances and take its frame as the starting point and use the mask propagation module to bi-directionally recover the lost instances.” The development of VOS in the next decade will achieve higher precision while meeting real-time application requirements. At present, the cost of manual annotation of pixel-level VOS data sets is too expensive, so cheaper large-scale VOS data sets are expected in the future.

Similar Papers
  • Research Article
  • Cite Count Icon 2
  • 10.1145/3611389
Semi-supervised Video Object Segmentation Via an Edge Attention Gated Graph Convolutional Network
  • Sep 18, 2023
  • ACM Transactions on Multimedia Computing, Communications, and Applications
  • Yuqing Zhang + 4 more

Video object segmentation (VOS) exhibits heavy occlusions, large deformation, and severe motion blur. While many remarkable convolutional neural networks are devoted to the VOS task, they often mis-identify background noise as the target or output coarse object boundaries, due to the failure of mining detail information and high-order correlations of pixels within the whole video. In this work, we propose an edge attention gated graph convolutional network (GCN) for VOS. The seed point initialization and graph construction stages construct a spatio-temporal graph of the video by exploring the spatial intra-frame correlation and the temporal inter-frame correlation of superpixels. The node classification stage identifies foreground superpixels by using an edge attention gated GCN which mines higher-order correlations between superpixels and propagates features among different nodes. The segmentation optimization stage optimizes the classification of foreground superpixels and reduces segmentation errors by using a global appearance model which captures the long-term stable feature of objects. In summary, the key contribution of our framework is twofold: (a) the spatio-temporal graph representation can propagate the seed points of the first frame to subsequent frames and facilitate our framework for the semi-supervised VOS task; and (b) the edge attention gated GCN can learn the importance of each node with respect to both the neighboring nodes and the whole task with a small number of layers. Experiments on Davis 2016 and Davis 2017 datasets show that our framework achieves the excellent performance with only small training samples (45 video sequences).

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 31
  • 10.3390/s21237949
One Spatio-Temporal Sharpening Attention Mechanism for Light-Weight YOLO Models Based on Sharpening Spatial Attention
  • Nov 28, 2021
  • Sensors (Basel, Switzerland)
  • Mengfan Xue + 4 more

Attention mechanisms have demonstrated great potential in improving the performance of deep convolutional neural networks (CNNs). However, many existing methods dedicate to developing channel or spatial attention modules for CNNs with lots of parameters, and complex attention modules inevitably affect the performance of CNNs. During our experiments of embedding Convolutional Block Attention Module (CBAM) in light-weight model YOLOv5s, CBAM does influence the speed and increase model complexity while reduce the average precision, but Squeeze-and-Excitation (SE) has a positive impact in the model as part of CBAM. To replace the spatial attention module in CBAM and offer a suitable scheme of channel and spatial attention modules, this paper proposes one Spatio-temporal Sharpening Attention Mechanism (SSAM), which sequentially infers intermediate maps along channel attention module and Sharpening Spatial Attention (SSA) module. By introducing sharpening filter in spatial attention module, we propose SSA module with low complexity. We try to find a scheme to combine our SSA module with SE module or Efficient Channel Attention (ECA) module and show best improvement in models such as YOLOv5s and YOLOv3-tiny. Therefore, we perform various replacement experiments and offer one best scheme that is to embed channel attention modules in backbone and neck of the model and integrate SSAM into YOLO head. We verify the positive effect of our SSAM on two general object detection datasets VOC2012 and MS COCO2017. One for obtaining a suitable scheme and the other for proving the versatility of our method in complex scenes. Experimental results on the two datasets show obvious promotion in terms of average precision and detection performance, which demonstrates the usefulness of our SSAM in light-weight YOLO models. Furthermore, visualization results also show the advantage of enhancing positioning ability with our SSAM.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icspcc46631.2019.8960816
Semi-supervised Video Object Segmentation with Recurrent Neural Network
  • Sep 1, 2019
  • Xuanguang Ren + 3 more

Object segmentation in videos has been extensively investigated recent years. However, semi-supervised object segmentation in videos is still a challenging research topic as it is hard to modeling temporal information. Most of research treats video frames independence and lost the relationship between adjacent frames. To overcome the limitation, Semi-supervised Video Object Segmentation with Recurrent Neural Network (SVOSR) has been proposed which combines convolutional gated recurrent unit (ConvGRU) to learn the temporal information between adjacent frames. The proposed method can be treated as three main parts. First, the feature extraction part is proposed to generate spatial information from adjacent frames. Second the relation part extracts temporal information from the adjacent spatial information. Thirdly, the decoder part combines the spatiotemporal information and inference the results. We put forward the relation part and design the decoder part to better segmentation. Experiments show that our method shows achievable accuracy and has the order of magnitude faster inference time compared with OSVOS and other methods based on DAVIS dataset.

  • Research Article
  • Cite Count Icon 19
  • 10.1109/tpami.2022.3163375
Video Object Segmentation Using Kernelized Memory Network With Multiple Kernels.
  • Feb 1, 2023
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Hongje Seong + 2 more

Semi-supervised video object segmentation (VOS) is to predict the segment of a target object in a video when a ground truth segmentation mask for the target is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising approach for semi-supervised VOS. However, an important point has been overlooked in applying STM to VOS: The solution (=STM) is non-local, but the problem (=VOS) is predominantly local. To solve this mismatch between STM and VOS, we propose new VOS networks called kernelized memory network (KMN) and KMN with multiple kernels (KMN M). Our networks conduct not only Query-to-Memory matching but also Memory-to-Query matching. In Memory-to-Query matching, a kernel is employed to reduce the degree of non-localness of the STM. In addition, we present a Hide-and-Seek strategy in pre-training to handle occlusions effectively. The proposed networks surpass the state-of-the-art results on standard benchmarks by a significant margin (+4% in JM on DAVIS 2017 test-dev set). The runtimes of our proposed KMN and KMN M on DAVIS 2016 validation set are 0.12 and 0.13 seconds per frame, respectively, and the two networks have similar computation times to STM.

  • Conference Article
  • 10.1117/12.2631435
Comparison on video object segmentation: methods and results
  • Mar 18, 2022
  • Yifei Liu + 3 more

Video object segmentation (VOS) is a fundamental research area in the computer vision field in which the goal is to separate the target object(s) from the background in continuous frames of a video. It has huge application value and demand in human-computer interaction, video analysis, compression, re-creation, and numerous fields. Current VOS research in CV academia is mainly classified into four main categories: semi-supervised VOS, unsupervised VOS, weakly supervised VOS, and interactive VOS. In this paper, we give an overview of the latest methods in the first three categories. We summarized problems in each area and features of different methods aiming to solve them. Then we compare these methods to find out the performance in different test environments.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.neucom.2020.06.129
Weakly supervised video object segmentation initialized with referring expression
  • Sep 9, 2020
  • Neurocomputing
  • Xiaoqing Bu + 6 more

Weakly supervised video object segmentation initialized with referring expression

  • Book Chapter
  • Cite Count Icon 193
  • 10.1007/978-3-030-58542-6_38
Kernelized Memory Network for Video Object Segmentation
  • Jan 1, 2020
  • Hongje Seong + 2 more

Semi-supervised video object segmentation (VOS) is a task that involves predicting a target object in a video when the ground truth segmentation mask of the target object is given in the first frame. Recently, space-time memory networks (STM) have received significant attention as a promising solution for semi-supervised VOS. However, an important point is overlooked when applying STM to VOS. The solution (STM) is non-local, but the problem (VOS) is predominantly local. To solve the mismatch between STM and VOS, we propose a kernelized memory network (KMN). Before being trained on real videos, our KMN is pre-trained on static images, as in previous works. Unlike in previous works, we use the Hide-and-Seek strategy in pre-training to obtain the best possible results in handling occlusions and segment boundary extraction. The proposed KMN surpasses the state-of-the-art on standard benchmarks by a significant margin (+5% on DAVIS 2017 test-dev set). In addition, the runtime of KMN is 0.12 s per frame on the DAVIS 2016 validation set, and the KMN rarely requires extra computation, when compared with STM.

  • Research Article
  • Cite Count Icon 11
  • 10.1109/tip.2023.3321462
Prototypical Matching Networks for Video Object Segmentation.
  • Jan 1, 2023
  • IEEE Transactions on Image Processing
  • Fanchao Lin + 5 more

Semi-supervised video object segmentation is the task of segmenting the target in sequential frames given the ground truth mask in the first frame. The modern approaches usually utilize such a mask as pixel-level supervision and typically exploit pixel-to-pixel matching between the reference frame and current frame. However, the matching at pixel level, which overlooks the high-level information beyond local areas, often suffers from confusion caused by similar local appearances. In this paper, we present Prototypical Matching Networks (PMNet) - a novel architecture that integrates prototypes into matching-based video objection segmentation frameworks as high-level supervision. Specifically, PMNet first divides the foreground and background areas into several parts according to the similarity to the global prototypes. The part-level prototypes and instance-level prototypes are generated by encapsulating the semantic information of identical parts and identical instances, respectively. To model the correlation between prototypes, the prototype representations are propagated to each other by reasoning on a graph structure. Then, PMNet stores both the pixel-level features and prototypes in the memory bank as the target cues. Three affinities, i.e., pixel-to-pixel affinity, prototype-to-pixel affinity, and prototype-to-prototype affinity, are derived to measure the similarity between the query frame and the features in the memory bank. The features aggregated from the memory bank using these affinities provide powerful discrimination from both the pixel-level and prototype-level perspectives. Extensive experiments conducted on four benchmarks demonstrate superior results than the state-of-the-art video object segmentation techniques.

  • Research Article
  • Cite Count Icon 4
  • 10.1016/j.neucom.2024.128076
Structural Transformer with Region Strip Attention for Video Object Segmentation
  • Jun 18, 2024
  • Neurocomputing
  • Qingfeng Guan + 6 more

Structural Transformer with Region Strip Attention for Video Object Segmentation

  • Research Article
  • Cite Count Icon 20
  • 10.1109/tip.2018.2859622
Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks.
  • Jul 30, 2018
  • IEEE Transactions on Image Processing
  • Ziyi Liu + 6 more

It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.

  • Research Article
  • 10.1016/j.asoc.2025.112837
Vanishing mask refinement in semi-supervised video object segmentation
  • Mar 1, 2025
  • Applied Soft Computing
  • Javier Pita + 4 more

This paper presents Video Object Segmentation Enhanced with Segment Anything Model (VOS-E-SAM), a multi-stage architecture for Semi-supervised Video Object Segmentation (SVOS) using the foundational Segment Anything Model (SAM) architecture, aimed at addressing the challenges of mask degradation over time in long video sequences. Our architectural approach enhances the object masks produced by the XMem model by incorporating SAM. This integration uses various input combinations and low-level computer vision techniques to generate point prompts, in order to improve mask continuity and accuracy throughout the entire video cycle. The main challenge addressed is the fading or vanishing of object masks in long video sequences due to problems such as changes in object appearance, occlusions, camera movements, and approach changes. Both the baseline architecture and the newer high-quality version are tested, addressing the primary challenge of object mask fading or vanishing in long video sequences due to changes in object appearance, occlusions, camera movements, and variations in approach. Through rigorous experimentation with different prompt configurations, we identified an outstanding configuration of SAM inputs to improve mask refinement. Evaluations on benchmark long video datasets, such as LongDataset and LVOS, show that our approach significantly improves mask quality in single-object extended video sequences proven by percentage increments on jaccard index ( J ) and contour accuracy ( F ) based metrics (mean, recall and decay). Our results show remarkable improvements in mask persistence and accuracy, which sets a new standard for the integration of foundational models in video segmentation and lays the foundation for future research in this field. Github. VOS-E-SAM • We tackle semi-supervised video object segmentation with foundation models. • Our method reduces segmentation vanishing in semi-supervised segmentation. • We propose 20+ input configurations to refine masks using SAM and HQ-SAM. • Our approach outperforms the baseline in long videos, advancing the state-of-the-art.

  • Research Article
  • 10.1109/tcsvt.2024.3451981
MoBox: Enhancing Video Object Segmentation With Motion-Augmented Box Supervision
  • Jan 1, 2025
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Xiaomin Li + 6 more

We propose MoBox, a low-cost solution for semi-supervised video object segmentation that requires only bounding boxes as manual annotations for training. Built upon a mature semi-supervised video object segmentation network, we redesign the training losses and employ a more stringent training strategy. Specifically, we introduce a well-designed constraint term that enhances traditional spatial projection by simultaneously leveraging the projections of both the ground-truth box and the predicted mask across two axes, rather than evaluating discrepancies along the x-axis and y-axis independently. To harness the intrinsic properties of videos, considering the underlying correspondence between motion represented by optical flow and the original image, we incorporate motion coherence information into the color consistency loss as supplementary information and propose a motion discrepancy loss to obtain accurate boundaries. Additionally, to mitigate the ambiguity of weak supervision, we further introduce the pseudo strict constraint during training, which significantly improves model performance. Our approach yields competitive scores on popular benchmarks, achieving a <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mathcal {J}\&amp; \mathcal {F}$ </tex-math></inline-formula> score of 78.6 on the DAVIS 2017 validation set and an Overall score of 78.0 on the YouTube-VOS 2018 validation set. These results highlight the efficacy of MoBox, demonstrating that the semi-supervised video object segmentation model can be effectively trained using only motion-augmented box supervision and intrinsic information of videos.

  • Research Article
  • Cite Count Icon 1
  • 10.1016/j.neunet.2025.107976
Adaptively trigger memory network with temporal consistency for semi-supervised long video object segmentation.
  • Dec 1, 2025
  • Neural networks : the official journal of the International Neural Network Society
  • Fan Zhang + 2 more

Adaptively trigger memory network with temporal consistency for semi-supervised long video object segmentation.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-3-030-88013-2_27
Dual Attention Based Network with Hierarchical ConvLSTM for Video Object Segmentation
  • Jan 1, 2021
  • Zongji Zhao + 1 more

Semi-supervised Video object segmentation is one of the most basic tasks in the field of computer vision, especially in the multi-object case. It aims to segment masks of multiple foreground objects in given video sequence with annotation mask of the first frame as prior knowledge. In this paper, we propose a novel multi-object video segmentation model. We use the U-Net architecture to obtain multi-scale spatial features. In the encoder part, the spatial attention mechanism and channel attention is used to enhance the spatial features simultaneously. We use the recurrent ConvLSTM module in the decoder to segment different object instances in one stage and keep the segmentation object consistent over time. In addition, we use three loss functions for joint training to improve the model training effect. We test our network on the popular video object segmentation dataset DAVIS2017. The experiment results demonstrate that our model achieves state-of-art performance. Moreover, our model achieves faster inference runtimes than other methods.

  • Conference Article
  • Cite Count Icon 97
  • 10.1109/iccv48922.2021.00953
Joint Inductive and Transductive Learning for Video Object Segmentation
  • Oct 1, 2021
  • Yunyao Mao + 3 more

Semi-supervised video object segmentation is a task of segmenting the target object in a video sequence given only a mask annotation in the first frame. The limited information available makes it an extremely challenging task. Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning. Nevertheless, they are either less discriminative for similar instances or insufficient in the utilization of spatio-temporal information. In this work, we propose to integrate transductive and inductive learning into a unified framework to exploit the complementarity between them for accurate and robust video object segmentation. The proposed approach consists of two functional branches. The transduction branch adopts a lightweight transformer architecture to aggregate rich spatio-temporal cues while the induction branch performs online inductive learning to obtain discriminative target information. To bridge these two diverse branches, a two-head label encoder is introduced to learn the suitable target prior for each of them. The generated mask encodings are further forced to be disentangled to better retain their complementarity. Extensive experiments on several prevalent benchmarks show that, without the need of synthetic training data, the proposed approach sets a series of new state-of-the-art records. Code is available at https://github.com/maoyunyao/JOINT.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant