DIPNet: Dynamic Identity Propagation Network for Video Object Segmentation

  • TL;DR
  • Abstract
  • Literature Map
  • Similar Papers
TL;DR

DIPNet introduces a dynamic identity propagation approach for semi-supervised video object segmentation, combining adaptive temporal propagation with lightweight fine-tuning, achieving state-of-the-art results and improved robustness to object variations over four benchmarks while maintaining high efficiency.

Abstract
Translate article icon Translate Article Star icon

Many recent methods for semi-supervised Video Object Segmentation (VOS) have achieved good performance by exploiting the annotated first frame via one-shot fine-tuning or mask propagation. However, heavily relying on the first frame may weaken the robustness for VOS, since video objects can show large variations through time. In this work, we propose a Dynamic Identity Propagation Network (DIPNet) that adaptively propagates and accurately segments the video objects over time. To achieve this, DIPNet factors the VOS task at each time step into a dynamic propagation phase and a spatial segmentation phase. The former utilizes a novel identity representation to adaptively propagate objects’ reference information over time, which enhances the robustness to videos’ temporal variations. The segmentation phase uses the propagated information to tackle the object segmentation as an easier static image problem that can be optimized via light-weight fine-tuning on the first frame, thus reducing the computational cost. As a result, by optimizing these two components to complement each other, we can achieve a robust system for VOS. Evaluations on four benchmark datasets show that DIPNet provides state-of-the-art performance with time efficiency.

Similar Papers
  • Research Article
  • Cite Count Icon 20
  • 10.1109/tip.2018.2859622
Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks.
  • Jul 30, 2018
  • IEEE Transactions on Image Processing
  • Ziyi Liu + 6 more

It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/wcica.2008.4594556
Video Object Segmentation Based on Multi-Feature Clustering
  • Jun 1, 2008
  • Shuangyan Hu + 3 more

As a requisite of the emerging content-based multimedia technologies, video object segmentation is of great importance. This paper proposed a method of video object segmentation based on multi-feature clustering. At first, gain the twice-difference image from the three successive video frames. Then, eliminate the noise of background with the estimation of the feature parameter and extract the video object motion area. Afterward, employ the improved FCM clustering method to segment the motion area and get the video object mask by processing the previous result with morphological method. Finally, acquire the ideal video object. Experimental results show that the proposed method performs excellently for video object segmentation and outperforms the method of literature in spatial accuracy.

  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.image.2020.115858
Video object tracking and segmentation with box annotation
  • Apr 20, 2020
  • Signal Processing: Image Communication
  • Ye Wang + 6 more

Video object tracking and segmentation with box annotation

  • Research Article
  • Cite Count Icon 4
  • 10.1109/tpami.2025.3600507
MeViS: A Multi-Modal Dataset for Referring Motion Expression Video Segmentation.
  • Dec 1, 2025
  • IEEE transactions on pattern analysis and machine intelligence
  • Henghui Ding + 6 more

This paper proposes a large-scale multi-modal dataset for referring motion expression video segmentation, focusing on segmenting and tracking target objects in videos based on language description of objects' motions. Existing referring video segmentation datasets often focus on salient objects and use language expressions rich in static attributes, potentially allowing the target object to be identified in a single frame. Such datasets underemphasize the role of motion in both videos and languages. To explore the feasibility of using motion expressions and motion reasoning clues for pixel-level video understanding, we introduce MeViS, a dataset containing 33,072 human-annotated motion expressions in both text and audio, covering 8,171 objects in 2,006 videos of complex scenarios. We benchmark 15 existing methods across 4 tasks supported by MeViS, including 6 referring video object segmentation (RVOS) methods, 3 audio-guided video object segmentation (AVOS) methods, 2 referring multi-object tracking (RMOT) methods, and 4 video captioning methods for the newly introduced referring motion expression generation (RMEG) task. The results demonstrate weaknesses and limitations of existing methods in addressing motion expression-guided video understanding. We further analyze the challenges and propose an approach LMPM++ for RVOS/AVOS/RMOT that achieves new state-of-the-art results. Our dataset provides a platform that facilitates the development of motion expression-guided video understanding algorithms in complex video scenes.

  • Conference Article
  • 10.1109/his.2009.53
Video Object Segmentation by Integrating Motion Information and Gradient Compensation without Background Construction
  • Jan 1, 2009
  • Wu-Chih Hu + 3 more

This paper proposes an effective method for video object segmentation without background construction. In the proposed method, the coarse foreground extraction and fine foreground extraction are obtained using the motion information, edge information, and gradient-variation information which are first evaluated by two successive frames. Next, the video object is extracted using the horizontal/vertical filling scheme based on the coarse foreground extraction and fine foreground extraction. Finally, video object refining is used to obtain the more accurate video object. Experimental results show that the proposed method has good performance in sensitivity, specificity, and spatial accuracy.

  • Conference Article
  • 10.1117/12.538708
An efficient automatic video segmentation method based on intersection of frame differences
  • Sep 29, 2003
  • Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
  • Xin Zhang + 3 more

A novel automatic Video Object (VO) segmentation method is presented in this paper, which is based on intersection of frame differences. Horizontal scan is used to acquire coarse VO mask and edge detection is performed on VO boundaries to remove uncovered background contained in the intersection. And morphological operator <i>open</i> is applied to smooth VO contours after extraction. Experimental results show that it is accurate and especially efficient, and can wonderfully meet the real-time requirements of applications such as stationary camera video surveillance.

  • Book Chapter
  • Cite Count Icon 3
  • 10.4018/978-1-59904-845-1.ch106
Video Object Segmentation
  • Jan 1, 2009
  • Ee Ping Ong + 1 more

Video object segmentation aims to extract different video objects from a video (i.e., a sequence of consecutive images). It has attracted vast interests and substantial research effort for the past decade because it is a prerequisite for visual content retrieval (e.g., MPEG-7 related schemes), object-based compression and coding (e.g., MPEG-4 codecs), object recognition, object tracking, security video surveillance, traffic monitoring for law enforcement, and many other applications. Video object segmentation is a nonstandardized but indispensable component for an MPEG4/7 scheme in order to successfully develop a complete solution. In fact, in order to utilize MPEG-4 object-based video coding, video object segmentation must first be carried out to extract the required video object masks. Video object segmentation is an even more important issue in military applications such as real-time remote missile/vehicle/soldier’s identification and tracking. Other possible applications include home/office/warehouse security where monitoring and recording of intruders/foreign objects, alarming the personnel concerned or/and transmitting the segmented foreground objects via a bandwidth-hungry channel during the appearance of intruders are of particular interest. Thus, it can be seen that fully automatic video object segmentation tool is a very useful tool that has very wide practical applications in our everyday life where it can contribute to improved efficiency, time, manpower, and cost savings.

  • Conference Article
  • Cite Count Icon 8
  • 10.1109/mmsp.1999.793844
An accurate region based object tracking for video sequences
  • Jan 1, 1999
  • Dongxiang Xu + 2 more

With the popularity of MPEG-4 and MPEG-7 standards, video object (VO) segmentation becomes a very challenging research area in video applications. We present a novel method for semi-automatic object segmentation for a video sequence. The proposed approach starts with a rough user input VO definition. It then combines each frame's region segmentation and motion estimation results to construct the objects of interest for temporal tracking this object along the time. An active contour model based algorithm is employed to further fine-tune the object's contour so as to extract accurate object boundary. Some experimental results and future research directions are also discussed.

  • Research Article
  • Cite Count Icon 16
  • 10.1109/tnnls.2021.3054769
Directional Deep Embedding and Appearance Learning for Fast Video Object Segmentation.
  • Aug 1, 2022
  • IEEE Transactions on Neural Networks and Learning Systems
  • Yingjie Yin + 3 more

Most recent semisupervised video object segmentation (VOS) methods rely on fine-tuning deep convolutional neural networks online using the given mask of the first frame or predicted masks of subsequent frames. However, the online fine-tuning process is usually time-consuming, limiting the practical use of such methods. We propose a directional deep embedding and appearance learning (DDEAL) method, which is free of the online fine-tuning process, for fast VOS. First, a global directional matching module (GDMM), which can be efficiently implemented by parallel convolutional operations, is proposed to learn a semantic pixel-wise embedding as an internal guidance. Second, an effective directional appearance model-based statistics is proposed to represent the target and background on a spherical embedding space for VOS. Equipped with the GDMM and the directional appearance model learning module, DDEAL learns static cues from the labeled first frame and dynamically updates cues of the subsequent frames for object segmentation. Our method exhibits the state-of-the-art VOS performance without using online fine-tuning. Specifically, it achieves a J & F mean score of 74.8% on DAVIS 2017 data set and an overall score G of 71.3% on the large-scale YouTube-VOS data set, while retaining a speed of 25 fps with a single NVIDIA TITAN Xp GPU. Furthermore, our faster version runs 31 fps with only a little accuracy loss.

  • Conference Article
  • Cite Count Icon 182
  • 10.1109/cvpr42600.2020.00898
Learning Video Object Segmentation From Unlabeled Videos
  • Jun 1, 2020
  • Xiankai Lu + 5 more

We propose a new method for video object segmentation (VOS) that addresses object pattern learning from unlabeled videos, unlike most existing methods which rely heavily on extensive annotated data. We introduce a unified unsupervised/weakly supervised learning framework, called MuG, that comprehensively captures intrinsic properties of VOS at multiple granularities. Our approach can help advance understanding of visual patterns in VOS and significantly reduce annotation burden. With a carefully-designed architecture and strong representation learning ability, our learned model can be applied to diverse VOS settings, including object-level zero-shot VOS, instance-level zero-shot VOS, and one-shot VOS. Experiments demonstrate promising performance in these settings, as well as the potential of MuG in leveraging unlabeled data to further improve the segmentation accuracy.

  • Research Article
  • 10.3390/s24196405
Click to Correction: Interactive Bidirectional Dynamic Propagation Video Object Segmentation Network.
  • Oct 2, 2024
  • Sensors (Basel, Switzerland)
  • Shuting Yang + 2 more

High-quality video object segmentation is a challenging visual computing task. Interactive segmentation can improve segmentation results. This paper proposes a multi-round interactive dynamic propagation instance-level video object segmentation network based on click interaction. The network consists of two parts: a user interaction segmentation module and a bidirectional dynamic propagation module. A prior segmentation network was designed in the user interaction segmentation module to better segment objects of different scales that users click on. The dynamic propagation network achieves high-precision video object segmentation through the bidirectional propagation and fusion of segmentation masks obtained from multiple rounds of interaction. Experiments on interactive segmentation datasets and video object segmentation datasets show that our method achieves state-of-the-art segmentation results with fewer click interactions.

  • Book Chapter
  • Cite Count Icon 4
  • 10.1007/978-3-030-42128-1_5
Unsupervised Learning of Object Segmentation in Video with Highly Probable Positive Features
  • Jan 1, 2020
  • Marius Leordeanu

Many times when learning without human supervision, it is possible to tell whether a certain cue or data sample is likely to belong to the positive class of interest. In this chapter, we study this case and show that such highly probably positive features could be reliably used for learning in the real natural world, without human supervision. We chose as use case the problem of foreground object segmentation, since it is one of the fundamental ones in vision. The main task, in this case, is to separate automatically the main object of interest present in a video sequence from its surrounding background. An efficient solution to this task would have an immense practical value. It would enable large-scale video interpretation at a high semantic level in the absence of the costly manual labeling. In this chapter, we present several unsupervised algorithms for generating foreground object soft masks based on automatic selection and learning from highly probable positive features. We start with a very simple and fast, yet surprisingly effective method that is able to produce robust object segmentations by using only simple colors as features. While being very simple to implement and understand, the algorithm constitutes the basis for a more general principle for learning from highly probable positive features, which we study theoretically and develop further within a more complex method for unsupervised video object segmentation. One important module in this algorithm connects to the feature selection by clustering method presented in Chap. 4—that approach is used in this case for learning an effective and robust patch-based descriptor based on color co-occurrences. We also introduce a novel and fast algorithm for background subtraction, called VideoPCA, based on modeling the background scene with a linear subspace and regarding the main foreground objects as regions that do not belong to that subspace. All algorithms and ideas presented are, at the core, connected by a single fundamental idea—that of learning from highly probable positive features, which are easy to detect in an unsupervised way with high precision and are effective, together, in learning powerful classifiers. The idea naturally starts and evolves from the insights and conclusions of the previous chapters presented in the book. In this chapter, we show that such HPP features can be selected efficiently by taking into consideration the spatiotemporal appearance and motion consistency of the object in the video sequence. We also emphasize the role of the contrasting properties between the foreground object and its background. Our final foreground segmentation model is created over several stages: we start from pixel-level analysis and move to descriptors that consider information over groups of pixels combined with efficient motion analysis. We also prove theoretical properties of our unsupervised learning method, which under some mild constraints is guaranteed to learn the correct classifier even in the unsupervised case. We achieve competitive and even state-of-the-art results on the challenging YouTube-Objects and SegTrack datasets, while being at least one order of magnitude faster than the competition. The strong performance of our method, along with its theoretical properties, constitutes another step towards solving unsupervised discovery in video.

  • Research Article
  • Cite Count Icon 53
  • 10.1109/tpami.2018.2890659
Online Meta Adaptation for Fast Video Object Segmentation
  • Jan 1, 2019
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Huaxin Xiao + 4 more

Conventional deep neural networks based video object segmentation (VOS) methods are dominated by heavily fine-tuning a segmentation model on the first frame of a given video, which is time-consuming and inefficient. In this paper, we propose a novel method which rapidly adapts a base segmentation model to new video sequences with only a couple of model-update iterations, without sacrificing performance. Such attractive efficiency benefits from the meta-learning paradigm which leads to a meta-segmentation model and a novel continuous learning approach which enables online adaptation of the segmentation model. Concretely, we train a meta-learner on multiple VOS tasks such that the meta model can capture their common knowledge and gains the ability to fast adapt the segmentation model to new video sequences. Furthermore, to deal with unique challenges of VOS tasks from temporal variations in the video, e.g., object motion and appearance changes, we propose a principled online adaptation approach that continuously adapts the segmentation model across video frames by exploiting temporal context effectively, providing robustness to annoying temporal variations. Integrating the meta-learner with the online adaptation approach, the proposed VOS model achieves competitive performance against the state-of-the-arts and moreover provides faster per-frame processing speed.

  • Research Article
  • Cite Count Icon 2
  • 10.1145/3617067
Attention-guided Adversarial Attack for Video Object Segmentation
  • Nov 14, 2023
  • ACM Transactions on Intelligent Systems and Technology
  • Rui Yao + 6 more

Video Object Segmentation (VOS) methods have made many breakthroughs with the help of the continuous development and advancement of deep learning. However, the deep learning model is vulnerable to malicious adversarial attacks, which mislead the model to make wrong decisions by adding adversarial perturbation that humans cannot perceive to the input image. Threats to deep learning models remind us that video object segmentation methods are also vulnerable to attacks, thereby threatening their security. Therefore, we study adversarial attacks on the VOS task to better identify the vulnerabilities of the VOS method, which in turn provides an opportunity to improve its robustness. In this paper, we propose an attention-guided adversarial attack method, which uses spatial attention blocks to capture features with global dependencies to construct correlations between consecutive video frames, and performs multipath aggregation to effectively integrate spatial-temporal perturbation, thereby guiding the deconvolution network to generate adversarial examples with strong attack capability. Specifically, the class loss function is designed to enable the deconvolution network to better activate noise in other regions and suppress the activation related to the object class based on the enhanced feature map of the object class. At the same time, attentional feature loss is designed to enhance the transferability against attack. The experimental results on the DAVIS dataset show that the proposed attention-guided adversarial attack method can significantly reduce the segmentation accuracy of OSVOS, and the J &amp; F mean on DAVIS 2016 can reach 73.6% drop rate. The generated adversarial examples are also highly transferable to other video object segmentation models.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icosst48232.2019.9043975
Object Segmentation in Video Sequences by using Single Frame Processing
  • Dec 1, 2019
  • Muhammad Hamza Bhatti + 2 more

Object segmentation, detection and tracking in videos is one of the most important task of computer vision. It is necessary in all of the real time deployed surveillance systems. Various unsupervised and semi-supervised video object segmentation techniques have been implemented and shown efficient results. But all of these techniques process all of the frames of a video sequence, which requires a huge training data and results in a large computational time. In this paper, a semi-supervised technique is proposed which segments an object in a video by just processing a single frame of the sequence. In this framework, a fully convolutional network is used to separate the foreground from the image, create the mask of the object and then segments the object with the help of this mask. The foreground separation in a frame is done by using pre-trained network while, training and testing of rest of the network is done using a specified dataset named as DAVIS. The results show that, the proposed framework takes less computational time and has also improved the overall accuracy of video object segmentation by 10% as compared to previous techniques.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant