Accelerate Literature Icon
Want to do a literature review? Try our new Literature Review workflow

Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Significant progress has been made in Video Object Segmentation (VOS), the video object tracking task in its finest level. While the VOS task can be naturally decoupled into image semantic segmentation and video object tracking, significantly much more research effort has been made in segmentation than tracking. In this paper, we introduce "tracking-by-detection" into VOS which can coherently integrate segmentation into tracking, by proposing a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. Notably, our method is entirely online and thus suitable for one-shot learning, and our end-to-end trainable model allows multiple object segmentation in one forward pass. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.

Similar Papers
  • Research Article
  • Cite Count Icon 33
  • 10.1109/tcsvt.2013.2242595
Video Object Segmentation and Tracking Framework With Improved Threshold Decision and Diffusion Distance
  • Jun 1, 2013
  • IEEE Transactions on Circuits and Systems for Video Technology
  • Shao-Yi Chien + 3 more

Video object segmentation and tracking are two essential building blocks of smart surveillance systems. However, there are several issues that need to be resolved. Threshold decision is a difficult problem for video object segmentation with a multi-background model. In addition, some conditions make robust video object tracking difficult. These conditions include nonrigid object motion, target appearance variations due to changes in illumination, and background clutter. In this paper, a video object segmentation and tracking framework is proposed for smart cameras in visual surveillance networks with two major contributions. First, we propose a robust threshold decision algorithm for video object segmentation with a multi-background model. Second, we propose a video object tracking framework based on a particle filter with the likelihood function composed of diffusion distance for measuring color histogram similarity and motion clue from video object segmentation. The proposed framework can track nonrigid moving objects under drastic changes in illumination and background clutter. Experimental results show that the presented algorithms perform well for several challenging sequences, and our proposed methods are effective for the aforementioned issues.

  • Research Article
  • Cite Count Icon 20
  • 10.1109/tip.2018.2859622
Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks.
  • Jul 30, 2018
  • IEEE Transactions on Image Processing
  • Ziyi Liu + 6 more

It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.

  • Research Article
  • Cite Count Icon 6
  • 10.1016/j.image.2020.115858
Video object tracking and segmentation with box annotation
  • Apr 20, 2020
  • Signal Processing: Image Communication
  • Ye Wang + 6 more

Video object tracking and segmentation with box annotation

  • Research Article
  • 10.1049/el.2019.0992
Divided attention
  • Apr 1, 2019
  • Electronics Letters
  • Anonymous

Researchers from Nanjing University of Information Science and Technology (NUIST) present an attention-modulating network for video object segmentation with an advanced attention modulator to efficiently modulate a segmentation model to focus on a specific object of interest. The group employ a focal loss that distinguishes simple samples from more difficult ones to accelerate the convergence of network training to achieve state-of-the-art segmentation performance. Video object segmentation (VOS) is a fundamental task in computer vision, with important applications in video editing, robotics, and self-driving cars. VOS tasks are mainly categorised into unsupervised and semi-supervised classifications. The former seeks to find and segment the salient targets in the videos completely without supervision, with the algorithm itself deciding what the main segmentation is. The latter aims at segmenting an object instance throughout the entire video sequence given only the object mask on the first frame. This can be observed as a pixel-level object tracking problem. Semi-supervised VOS can be subdivided into single-object segmentation and multi-object segmentation. In the team's Letter, they focus on semi-supervised VOS. Deep learning for VOS has gained attention in the research community in recent years. Existing semi-supervised VOS techniques work by constructing deep networks and fine-tuning a pre-trained classifier on a given ground truth in the first frame during online testing. This online fine-tuning of a classifier during testing has been shown to significantly improve accuracy. Illustrative diagram of the proposed segmentation model and approach. Segmentation results. The team conduct an attention-modulating network for the semi-supervised VOS task. Co-author Kaihua Zhang elaborates on the process: “We designed an efficient visual and spatial attention modulator based on the semantic information of the annotated object in the first frame and the spatial information of predicted object mask in the previous frame, respectively, to fast module the segmentation model to focus on the specific object of interest. Then we design a SCAM architecture which includes a channel attention module and a spatial attention module and inject it into segmentation model to further refine its feature maps. In addition, we construct a feature pyramid attention module to mine context information of different scales to solve the problem of multi-scale segmentation. Most existing methods rely on fine-tuning models using first-frame annotations and are time-consuming, making them unsuitable for most practical applications. To address this issue, the proposed approach developed an attention-modulating network to focus on the appearance of a specific object instance in one single feed-forward pass without fine-tuning. Compared with other methods, this method has achieved state-of-art performance on the DAVIS2017 dataset by using attention-modulators, feature attention pyramid modules and focal loss. In order to overcome a sample imbalance problem, reference was made to focal loss which can accelerate the convergence of network training, thus helping to distinguish between difficult and simple samples. VOS remains challenging due to occlusions, fast motion, deformation, and significant appearance variations over time. This method conducts a visual attention modulator to extract semantic information such as category, color and shape from the first frame. The spatial attention modulator fits the predicted location of object masks in the previous frame as a spatial prior to guide the segmentation network to focus on the regions where that target is most likely to appear in the current frame. To solve the multi-scales of segmentation objects, feature pyramid attention modules mined the context information of different scales, achieving better pixel-level attention for the high-level feature maps. The proposed VOS approach is fast, which facilitates many applications, such as interactive video editing and augmented reality. It may be applied to video understanding models in the short term, and after long-term development, it may be applied to robotics, and self-driving cars. Kaihua Zhang notes on his groups future work: “Experiments show that our algorithm performs erroneous instance segmentation when faced with the challenge of occluding each other between similar objects. To tackle this problem, we will leverage a position-sensitive embedding which is capable of distinguishing the pixels of similar objects. We have also found that solving VOS with multiple instances requires template matching to deal with occlusion and temporal propagation to ensure temporal continuity; otherwise the segmentation instance would be lost. Thus, we will use the re-identification module to retrieve lost instances and take its frame as the starting point and use the mask propagation module to bi-directionally recover the lost instances.” The development of VOS in the next decade will achieve higher precision while meeting real-time application requirements. At present, the cost of manual annotation of pixel-level VOS data sets is too expensive, so cheaper large-scale VOS data sets are expected in the future.

  • Research Article
  • Cite Count Icon 40
  • 10.1109/tcsvt.2004.828347
Robust Segmentation and Tracking of Colored Objects in Video
  • Jun 1, 2004
  • IEEE Transactions on Circuits and Systems for Video Technology
  • T Gevers

Segmenting and tracking of objects in video is of great importance for video-based encoding, surveillance, and retrieval. However, the inherent difficulty of object segmentation and tracking is to distinguish changes in the displacement of objects from disturbing effects such as noise and illumination changes. Therefore, in this paper, we formulate a color-based deformable model which is robust against noisy data and changing illumination. Computational methods are presented to measure color constant gradients. Further, a model is given to estimate the amount of sensor noise through these color constant gradients. The obtained uncertainty is subsequently used as a weighting term in the deformation process. Experiments are conducted on image sequences recorded from three-dimensional scenes. From the experimental results, it is shown that the proposed color constant deformable method successfully finds object contours robust against illumination, and noisy, but homogeneous regions.

  • Conference Article
  • Cite Count Icon 7
  • 10.1109/icme.2000.871574
Segmentation and tracking of video objects for a content-based video indexing context
  • Apr 28, 2017
  • M Maziere + 3 more

This paper examines the problem of segmentation and tracking of video objects for content-based information retrieval. Segmentation and tracking of video objects plays an important role in index creation and user request definition steps. The object is initially selected using a semi-automatic approach. For this purpose, a user-based selection is required to define roughly the object to be tracked. In this paper, we propose two different methods to allow an accurate contour definition from the user selection. The first one is based on an active contour model which progressively refines the selection by fitting the natural edges of the object while the second used a binary partition tree with a marker and propagation approach. The video object is thus tracked by using a hybrid structure alternately combining a hierarchical mesh for the motion estimation between two frames and a multi-resolution active contour model. This contour model is derived directly from the mesh boundaries in order to reposition the snake's nodes onto the natural edges of the object. The object-based segmentation associated with object tracking allows relevant descriptors to be built for a future matching purpose.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icosst48232.2019.9043975
Object Segmentation in Video Sequences by using Single Frame Processing
  • Dec 1, 2019
  • Muhammad Hamza Bhatti + 2 more

Object segmentation, detection and tracking in videos is one of the most important task of computer vision. It is necessary in all of the real time deployed surveillance systems. Various unsupervised and semi-supervised video object segmentation techniques have been implemented and shown efficient results. But all of these techniques process all of the frames of a video sequence, which requires a huge training data and results in a large computational time. In this paper, a semi-supervised technique is proposed which segments an object in a video by just processing a single frame of the sequence. In this framework, a fully convolutional network is used to separate the foreground from the image, create the mask of the object and then segments the object with the help of this mask. The foreground separation in a frame is done by using pre-trained network while, training and testing of rest of the network is done using a specified dataset named as DAVIS. The results show that, the proposed framework takes less computational time and has also improved the overall accuracy of video object segmentation by 10% as compared to previous techniques.

  • Book Chapter
  • 10.1007/978-3-030-34120-6_31
Enhanced Video Segmentation with Object Tracking
  • Jan 1, 2019
  • Zheran Hong + 5 more

The high efficiency and superior performance of fully convolutional network (FCN) architecture makes it a recent trend that employing FCN in video object segmentation task. While these FCN-based methods usually ignore the motion information between frames, which may lead to similar object inference or background clutter issues. To deal with these, we propose to use tracking techniques to improve the performance of video object segmentation. The proposed algorithm performs video object segmentation and tracking simultaneously in a unified framework. After that, the motion information provided by initial tracking result is used to rejecting outliers in the segmentation mask caused by background complexities, such as similar object inference or background clutter issues. In return, the final segmentation result can be used to supervise the tracking result. In this iterative way, the performances of the both tasks are enhanced. Experimental results on the challenging benchmark demonstrate the effectiveness of our proposed method.

  • Conference Article
  • Cite Count Icon 34
  • 10.1109/wacv56688.2023.00172
BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video
  • Jan 1, 2023
  • Ali Athar + 6 more

Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. $\mathcal{J}\& {\mathcal{F}}$, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference. Dataset annotations are available at: https://github.com/Ali2500/BURST-benchmark.

  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icip.2000.899356
Segmentation and tracking of video objects: suited to content-based video indexing, interactive television and production systems
  • Jan 1, 2000
  • M Maziere + 1 more

This paper examines the problem of segmentation and tracking of video objects for a content-based information retrieval context. Our method starts first with an interactive video object selection, then alternately tracks and fits the object of interest as long as possible. A user-based selection is required in order to initialize the process, whereas an active contour model progressively refines the selection by fitting the natural edges of the object. The video object is thus tracked by using a hybrid structure combining a hierarchical mesh for the motion estimation between two frames and a multi-resolution active contour model. This contour model is derived directly from the mesh boundaries in order to reposition the snake's nodes onto the natural edges of the object.

  • Book Chapter
  • Cite Count Icon 1
  • 10.1007/978-981-19-1018-0_57
Analysis of Multifeatured Threshold Filtered-Based Real-Time Video Segmentation and Tracking in Video Surveillance
  • Jan 1, 2022
  • T Kusuma + 1 more

Moving object segmentation and detection have become an important topic in computer perspective. As such, it is widely used in video surveillance such as driving assistance program, robots, traffic monitoring, and crime pattern identification. In addition, video object tracking is an important function in video surveillance systems because it provides temporary interactive information about moving objects. An important function of video object segmentation is to find and separate important elements in the video frame behind the domain. The purpose of video tracking is to combine targeted objects into consecutive video frames. First of all, enhanced threshold filtered video object detection and tracking (TFVODT) is designed to classify objects according to their size, color, and to get better accuracy of video object detection. Initially, the TFVODT framework distinguishes a video object by its characteristics such as size and color. The TFVODT framework performs the function of distinguishing an object through the median filter-based enhanced Laplacian thresholding process. Along with the support of the split object, the TFVODT framework does well to track the video object. Second, threshold filtered video object detection and tracking (ITFVODT) is developed to distinguish video’s elements based on their features such as texture, durability, and performance of video object detection. All video frames found in the ITFVODT framework contain the similar features as quality and contrast.KeywordsObject trackingITFVODTTFVODTEMFVDSegmentation

  • Conference Article
  • 10.1109/ijcnn48605.2020.9207305
Video object segmentation using spatio-temporal deep network
  • Jul 1, 2020
  • Akshaya Ramaswamy + 2 more

Video analysis is increasingly becoming possible with improvement in hardware and deep learning algorithms. Videos contain the spatial as well as the temporal information that come closest to the real-world visual information representation. Albeit the human brain can make better decisions using spatio-temporal data, the images and video frames captured from the same standard RGB camera will vary in quality. Deep learning has resulted in extraordinary performances for image analysis. Image-based deep networks have been modified and extended to work on video, and optical flow between the frames has been utilized to capture temporal variations. There is a gap in understanding whether such networks capture the spatio-temporal information collectively. The network that can capture the information effectively should be capable of good performances despite relatively bad quality video frames. In this work, different deep network architectures are explored and their ability to capture spatio-temporal features is explored. With the understanding of the advantages and disadvantages of the network components, a new network is designed for the task of video object segmentation (VOS). The performance of the proposed network is evaluated using the DAVIS dataset for three tasks: VOS using weak supervision, zero-shot VOS and one-shot VOS. The best performance is reported in comparison to the state-of-the-art on DAVIS dataset and the robustness of the model to noisy labels is demonstrated.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 140
  • 10.1007/s11263-019-01164-6
Lucid Data Dreaming for Video Object Segmentation
  • Mar 15, 2019
  • International Journal of Computer Vision
  • Anna Khoreva + 4 more

Convolutional networks reach top quality in pixel-level video object segmentation but require a large amount of training data (1k–100k) to deliver such results. We propose a new training strategy which achieves state-of-the-art results across three evaluation datasets while using 20,times –1000,times less annotated data than competing methods. Our approach is suitable for both single and multiple object segmentation. Instead of using large training sets hoping to generalize across domains, we generate in-domain training data using the provided annotation on the first frame of each video to synthesize—“lucid dream” (in a lucid dream the sleeper is aware that he or she is dreaming and is sometimes able to control the course of the dream)—plausible future video frames. In-domain per-video training data allows us to train high quality appearance- and motion-based models, as well as tune the post-processing stage. This approach allows to reach competitive results even when training from only a single annotated frame, without ImageNet pre-training. Our results indicate that using a larger training set is not automatically better, and that for the video object segmentation task a smaller training set that is closer to the target domain is more effective. This changes the mindset regarding how many training samples and general “objectness” knowledge are required for the video object segmentation task.

  • Conference Article
  • Cite Count Icon 17
  • 10.1145/3394171.3413942
Dual Temporal Memory Network for Efficient Video Object Segmentation
  • Oct 12, 2020
  • Kaihua Zhang + 5 more

Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.

  • Conference Article
  • Cite Count Icon 276
  • 10.1109/cvpr.2019.00542
RVOS: End-To-End Recurrent Network for Video Object Segmentation
  • Jun 1, 2019
  • Carles Ventura + 5 more

Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence. In our work, we propose a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to-end trainable. Our model incorporates recurrence on two different domains: (i) the spatial, which allows to discover the different object instances within a frame, and (ii) the temporal, which allows to keep the coherence of the segmented objects along time. We train RVOS for zero-shot video object segmentation and are the first ones to report quantitative results for DAVIS-2017 and YouTube-VOS benchmarks. Further, we adapt RVOS for one-shot video object segmentation by using the masks obtained in previous time steps as inputs to be processed by the recurrent module. Our model reaches comparable results to state-of-the-art techniques in YouTube-VOS benchmark and outperforms all previous video object segmentation methods not using online learning in the DAVIS-2017 benchmark. Moreover, our model achieves faster inference runtimes than previous methods, reaching 44ms/frame on a P100 GPU.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant