Joint Video Object Discovery and Segmentation by Coupled Dynamic Markov Networks.
It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.
- Conference Article
- 10.1109/ijcnn48605.2020.9207305
- Jul 1, 2020
Video analysis is increasingly becoming possible with improvement in hardware and deep learning algorithms. Videos contain the spatial as well as the temporal information that come closest to the real-world visual information representation. Albeit the human brain can make better decisions using spatio-temporal data, the images and video frames captured from the same standard RGB camera will vary in quality. Deep learning has resulted in extraordinary performances for image analysis. Image-based deep networks have been modified and extended to work on video, and optical flow between the frames has been utilized to capture temporal variations. There is a gap in understanding whether such networks capture the spatio-temporal information collectively. The network that can capture the information effectively should be capable of good performances despite relatively bad quality video frames. In this work, different deep network architectures are explored and their ability to capture spatio-temporal features is explored. With the understanding of the advantages and disadvantages of the network components, a new network is designed for the task of video object segmentation (VOS). The performance of the proposed network is evaluated using the DAVIS dataset for three tasks: VOS using weak supervision, zero-shot VOS and one-shot VOS. The best performance is reported in comparison to the state-of-the-art on DAVIS dataset and the robustness of the model to noisy labels is demonstrated.
- Research Article
- 10.1049/el.2019.0992
- Apr 1, 2019
- Electronics Letters
Researchers from Nanjing University of Information Science and Technology (NUIST) present an attention-modulating network for video object segmentation with an advanced attention modulator to efficiently modulate a segmentation model to focus on a specific object of interest. The group employ a focal loss that distinguishes simple samples from more difficult ones to accelerate the convergence of network training to achieve state-of-the-art segmentation performance. Video object segmentation (VOS) is a fundamental task in computer vision, with important applications in video editing, robotics, and self-driving cars. VOS tasks are mainly categorised into unsupervised and semi-supervised classifications. The former seeks to find and segment the salient targets in the videos completely without supervision, with the algorithm itself deciding what the main segmentation is. The latter aims at segmenting an object instance throughout the entire video sequence given only the object mask on the first frame. This can be observed as a pixel-level object tracking problem. Semi-supervised VOS can be subdivided into single-object segmentation and multi-object segmentation. In the team's Letter, they focus on semi-supervised VOS. Deep learning for VOS has gained attention in the research community in recent years. Existing semi-supervised VOS techniques work by constructing deep networks and fine-tuning a pre-trained classifier on a given ground truth in the first frame during online testing. This online fine-tuning of a classifier during testing has been shown to significantly improve accuracy. Illustrative diagram of the proposed segmentation model and approach. Segmentation results. The team conduct an attention-modulating network for the semi-supervised VOS task. Co-author Kaihua Zhang elaborates on the process: “We designed an efficient visual and spatial attention modulator based on the semantic information of the annotated object in the first frame and the spatial information of predicted object mask in the previous frame, respectively, to fast module the segmentation model to focus on the specific object of interest. Then we design a SCAM architecture which includes a channel attention module and a spatial attention module and inject it into segmentation model to further refine its feature maps. In addition, we construct a feature pyramid attention module to mine context information of different scales to solve the problem of multi-scale segmentation. Most existing methods rely on fine-tuning models using first-frame annotations and are time-consuming, making them unsuitable for most practical applications. To address this issue, the proposed approach developed an attention-modulating network to focus on the appearance of a specific object instance in one single feed-forward pass without fine-tuning. Compared with other methods, this method has achieved state-of-art performance on the DAVIS2017 dataset by using attention-modulators, feature attention pyramid modules and focal loss. In order to overcome a sample imbalance problem, reference was made to focal loss which can accelerate the convergence of network training, thus helping to distinguish between difficult and simple samples. VOS remains challenging due to occlusions, fast motion, deformation, and significant appearance variations over time. This method conducts a visual attention modulator to extract semantic information such as category, color and shape from the first frame. The spatial attention modulator fits the predicted location of object masks in the previous frame as a spatial prior to guide the segmentation network to focus on the regions where that target is most likely to appear in the current frame. To solve the multi-scales of segmentation objects, feature pyramid attention modules mined the context information of different scales, achieving better pixel-level attention for the high-level feature maps. The proposed VOS approach is fast, which facilitates many applications, such as interactive video editing and augmented reality. It may be applied to video understanding models in the short term, and after long-term development, it may be applied to robotics, and self-driving cars. Kaihua Zhang notes on his groups future work: “Experiments show that our algorithm performs erroneous instance segmentation when faced with the challenge of occluding each other between similar objects. To tackle this problem, we will leverage a position-sensitive embedding which is capable of distinguishing the pixels of similar objects. We have also found that solving VOS with multiple instances requires template matching to deal with occlusion and temporal propagation to ensure temporal continuity; otherwise the segmentation instance would be lost. Thus, we will use the re-identification module to retrieve lost instances and take its frame as the starting point and use the mask propagation module to bi-directionally recover the lost instances.” The development of VOS in the next decade will achieve higher precision while meeting real-time application requirements. At present, the cost of manual annotation of pixel-level VOS data sets is too expensive, so cheaper large-scale VOS data sets are expected in the future.
- Research Article
7
- 10.7763/ijcte.2010.v2.248
- Jan 1, 2010
- International Journal of Computer Theory and Engineering
In modern times, video object segmentation has emerged as one of the most imperative and challenging area of research. The principal objective of video object segmentation is to facilitate content-based representation by extracting objects of interest from a series of consecutive video frames. Recently, a number of video object segmentation algorithms have been discussed and unfortunately most existing segmentation algorithms are not adequate and robust enough to process noisy video sequences. Competence of most segmentation techniques is affected by the presence of noise in frames which is a critical issue of edge preservation. This paper presents a novel video object segmentation approach for noisy color video sequences towards effective video retrieval. Initially, the noisy video frames are denoised using a strategy based on an enhanced sparse representation in transform domain. Afterwards, the background is estimated from the denoised frames using the Expectation Maximization (EM) algorithm. Then, the foreground objects i.e.) moving video objects are segmented with the aid of the novel approach presented. The biorthogonal wavelet transform and the L2 norm distance measure are employed in the foreground object segmentation. The experimental results demonstrate the effectiveness of the presented approach in segmenting the video objects from noisy color video sequences.
- Conference Article
17
- 10.1145/3394171.3413942
- Oct 12, 2020
Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.
- Conference Article
82
- 10.1109/cvpr42600.2020.00890
- Jun 1, 2020
Significant progress has been made in Video Object Segmentation (VOS), the video object tracking task in its finest level. While the VOS task can be naturally decoupled into image semantic segmentation and video object tracking, significantly much more research effort has been made in segmentation than tracking. In this paper, we introduce "tracking-by-detection" into VOS which can coherently integrate segmentation into tracking, by proposing a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance. Notably, our method is entirely online and thus suitable for one-shot learning, and our end-to-end trainable model allows multiple object segmentation in one forward pass. We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
- Research Article
6
- 10.1016/j.image.2020.115858
- Apr 20, 2020
- Signal Processing: Image Communication
Video object tracking and segmentation with box annotation
- Conference Article
1
- 10.1109/icspcc46631.2019.8960816
- Sep 1, 2019
Object segmentation in videos has been extensively investigated recent years. However, semi-supervised object segmentation in videos is still a challenging research topic as it is hard to modeling temporal information. Most of research treats video frames independence and lost the relationship between adjacent frames. To overcome the limitation, Semi-supervised Video Object Segmentation with Recurrent Neural Network (SVOSR) has been proposed which combines convolutional gated recurrent unit (ConvGRU) to learn the temporal information between adjacent frames. The proposed method can be treated as three main parts. First, the feature extraction part is proposed to generate spatial information from adjacent frames. Second the relation part extracts temporal information from the adjacent spatial information. Thirdly, the decoder part combines the spatiotemporal information and inference the results. We put forward the relation part and design the decoder part to better segmentation. Experiments show that our method shows achievable accuracy and has the order of magnitude faster inference time compared with OSVOS and other methods based on DAVIS dataset.
- Research Article
53
- 10.1109/tpami.2018.2890659
- Jan 1, 2019
- IEEE Transactions on Pattern Analysis and Machine Intelligence
Conventional deep neural networks based video object segmentation (VOS) methods are dominated by heavily fine-tuning a segmentation model on the first frame of a given video, which is time-consuming and inefficient. In this paper, we propose a novel method which rapidly adapts a base segmentation model to new video sequences with only a couple of model-update iterations, without sacrificing performance. Such attractive efficiency benefits from the meta-learning paradigm which leads to a meta-segmentation model and a novel continuous learning approach which enables online adaptation of the segmentation model. Concretely, we train a meta-learner on multiple VOS tasks such that the meta model can capture their common knowledge and gains the ability to fast adapt the segmentation model to new video sequences. Furthermore, to deal with unique challenges of VOS tasks from temporal variations in the video, e.g., object motion and appearance changes, we propose a principled online adaptation approach that continuously adapts the segmentation model across video frames by exploiting temporal context effectively, providing robustness to annoying temporal variations. Integrating the meta-learner with the online adaptation approach, the proposed VOS model achieves competitive performance against the state-of-the-arts and moreover provides faster per-frame processing speed.
- Research Article
13
- 10.1007/s11263-019-01184-2
- May 27, 2019
- International Journal of Computer Vision
We present a novel form of interactive object segmentation called Click Carving which enables accurate segmentation of objects in images and videos with only a few point clicks. Whereas conventional interactive pipelines take the user’s initialization as a starting point, we show the value in the system taking lead even in initialization. In particular, for a given image or a video frame, the system precomputes a ranked list of thousands of possible segmentation hypotheses (also referred to as object region proposals) using appearance and motion cues. Then, the user looks at the top ranked proposals, and clicks on the object boundary to carve away erroneous ones. This process iterates (typically 2–3 times), and each time the system revises the top ranked proposal set, until the user is satisfied with a resulting segmentation mask. In the case of images, this mask is considered as the final object segmentation. However in the case of videos, the object region proposals rely on motion as well, and the resulting segmentation mask in the first frame is further propagated across the video to obtain a complete spatio-temporal object tube. On six challenging image and video datasets, we provide extensive comparisons with both existing work and simpler alternative methods. In all, the proposed Click Carving approach strikes an excellent of accuracy and human effort. It outperforms all similarly fast methods, and is competitive or better than those requiring 2–12 times the effort.
- Book Chapter
3
- 10.4018/978-1-59904-845-1.ch106
- Jan 1, 2009
Video object segmentation aims to extract different video objects from a video (i.e., a sequence of consecutive images). It has attracted vast interests and substantial research effort for the past decade because it is a prerequisite for visual content retrieval (e.g., MPEG-7 related schemes), object-based compression and coding (e.g., MPEG-4 codecs), object recognition, object tracking, security video surveillance, traffic monitoring for law enforcement, and many other applications. Video object segmentation is a nonstandardized but indispensable component for an MPEG4/7 scheme in order to successfully develop a complete solution. In fact, in order to utilize MPEG-4 object-based video coding, video object segmentation must first be carried out to extract the required video object masks. Video object segmentation is an even more important issue in military applications such as real-time remote missile/vehicle/soldier’s identification and tracking. Other possible applications include home/office/warehouse security where monitoring and recording of intruders/foreign objects, alarming the personnel concerned or/and transmitting the segmented foreground objects via a bandwidth-hungry channel during the appearance of intruders are of particular interest. Thus, it can be seen that fully automatic video object segmentation tool is a very useful tool that has very wide practical applications in our everyday life where it can contribute to improved efficiency, time, manpower, and cost savings.
- Conference Article
1
- 10.1109/icosst48232.2019.9043975
- Dec 1, 2019
Object segmentation, detection and tracking in videos is one of the most important task of computer vision. It is necessary in all of the real time deployed surveillance systems. Various unsupervised and semi-supervised video object segmentation techniques have been implemented and shown efficient results. But all of these techniques process all of the frames of a video sequence, which requires a huge training data and results in a large computational time. In this paper, a semi-supervised technique is proposed which segments an object in a video by just processing a single frame of the sequence. In this framework, a fully convolutional network is used to separate the foreground from the image, create the mask of the object and then segments the object with the help of this mask. The foreground separation in a frame is done by using pre-trained network while, training and testing of rest of the network is done using a specified dataset named as DAVIS. The results show that, the proposed framework takes less computational time and has also improved the overall accuracy of video object segmentation by 10% as compared to previous techniques.
- Conference Article
25
- 10.1109/wacv45572.2020.9093333
- Mar 1, 2020
Many recent methods for semi-supervised Video Object Segmentation (VOS) have achieved good performance by exploiting the annotated first frame via one-shot fine-tuning or mask propagation. However, heavily relying on the first frame may weaken the robustness for VOS, since video objects can show large variations through time. In this work, we propose a Dynamic Identity Propagation Network (DIPNet) that adaptively propagates and accurately segments the video objects over time. To achieve this, DIPNet factors the VOS task at each time step into a dynamic propagation phase and a spatial segmentation phase. The former utilizes a novel identity representation to adaptively propagate objects’ reference information over time, which enhances the robustness to videos’ temporal variations. The segmentation phase uses the propagated information to tackle the object segmentation as an easier static image problem that can be optimized via light-weight fine-tuning on the first frame, thus reducing the computational cost. As a result, by optimizing these two components to complement each other, we can achieve a robust system for VOS. Evaluations on four benchmark datasets show that DIPNet provides state-of-the-art performance with time efficiency.
- Dissertation
- 10.32657/10356/179521
- Jan 1, 2024
In the field of Computer Vision (CV), the pursuit of human-level recognition and reasoning of visual scenes has been a long-standing aspiration. Over the past decade, significant contributions to the progress of CV have been made by deep learning, facilitated by the availability of big data and increased computational power. In CV, visual understanding encompasses not only the fundamental aspect of recognition, which involves identifying and categorizing objects and patterns within visual data, but also reasoning, which involves higher-level cognitive processes such as inferring relationships, predicting outcomes, and drawing meaningful conclusions based on the observed visual information. Human-like recognition capabilities across a variety of visual recognition tasks such as image classification, object detection, semantic segmentation, and instance segmentation, have been achieved by machines. The huge success of deep learning for visual recognition tasks has prompted researchers to tackle visual reasoning tasks, which are more challenging. The recognition and reasoning of objects within a scene is of paramount importance to applications such as human-robot collaborations and autonomous vehicles. This thesis aims at addressing the issues associated with current deep learning models for visual understanding and will focus on three tasks: (1) video object segmentation, (2) abductive action inference, and (3) action-conditioned scene graph prediction. Firstly, video object segmentation is an object tracking task that traditionally relies on supervised learning and necessitates extensive annotated datasets for deep learning model training. It is confronted with issues of repetitiveness and impracticality due to the manual labeling of these vast datasets. In addition, tracking objects within videos presents a unique challenge caused by appearance changes and occlusion across video frames. The work proposed in this thesis explores the potential of self-supervised learning for video object segmentation, aiming to harness freely available internet data such as YouTube videos for model training. Different from existing self-supervised learning approaches for video object segmentation that model pixel-to-pixel correspondence, this research shifts the focus towards modeling superpixel-to-superpixel correspondence. To achieve this, a novel approach which involves the tracking of superpixels between video frames through an attention mechanism trained within an end-to-end self-supervised framework is proposed. This approach aims to enhance the performance of existing self-supervised video object segmentation models by leveraging deep learning techniques and mitigating the limitations associated with supervised learning using large-scale annotated datasets. It also aims to enhance existing self-supervised models with the benefits of superpixels. Secondly, this thesis introduces the complex task of abductive action inference to assess the abductive reasoning abilities of deep learning models in comprehending visual scenes. A set of innovative object-relational models designed specifically for this task is presented and compared with existing state-of-the-art image, video, and vision-language models. These models are tasked with abducting the human-performed action that led to the scene depicted in an image or snapshot. The experiments demonstrate the potential of deep learning models in performing abductive action inference. This research is vital in advancing our understanding of the reasoning capacities of deep learning models, representing a significant stride toward attaining human-level reasoning capabilities in visual comprehension. Lastly, in action-conditioned scene graph prediction, the task involves predicting scene graph relations for a future state based on an initial state and an action. Instead of predicting visual representations for future states, each state is organized into a scene graph comprising human-object relationship triplets, thereby succinctly encapsulating the dynamics within the scene. Notably, this task remains underdeveloped in visual understanding research, with prior video scene graph generation approaches largely overlooking actions as a crucial signal. Actions applied to a preconditioned state generate an effect state embodying their anticipated outcomes. Human cognitive reasoning allows the inference of action consequences based on the initial scene context. To address this challenge, this thesis introduces the Action-conditioned Scene Graph dataset. Furthermore, this thesis proposes the AERT (Action-conditioned Effect Relational Transformer) model, designed to capture scene relations and action context to effectively predict future scene graph relations.
- Research Article
2
- 10.1007/s11042-019-7569-5
- Apr 27, 2019
- Multimedia Tools and Applications
Video object segmentation is an important field in computer vision. However, the challenges in video object segmentation such as background clutter, occlusion and edge ambiguity cannot be avoided. In addition, existing labeled video object segmentation datasets are limited in size, which prevents CNN models from reaching their full generalization capabilities. In this paper, we propose a novel approach, called random grid-hiding (RGH), to perform data augmentation. We divide the training image into several rectangular regions and hide some regions randomly during model training. Thus, the convolutional neural network automatically focuses on the discriminative parts of the image. When the most discriminative part of the image is hidden, it compels the network focus on the other related parts of the image. Further, occlusion images are randomly generated in various levels. More features can be obtained by random grid-hiding, which can effectively reduce the risk of overfitting. Our approach is an effective extension of the data augmentation (such as random cropping and random flipping), and leads to improved accuracy in the task of the video object segmentation method on DAVIS dataset. Our experimental results show that the proposed method is a stable and effective method for data augmentation.
- Research Article
33
- 10.1109/tcsvt.2013.2242595
- Jun 1, 2013
- IEEE Transactions on Circuits and Systems for Video Technology
Video object segmentation and tracking are two essential building blocks of smart surveillance systems. However, there are several issues that need to be resolved. Threshold decision is a difficult problem for video object segmentation with a multi-background model. In addition, some conditions make robust video object tracking difficult. These conditions include nonrigid object motion, target appearance variations due to changes in illumination, and background clutter. In this paper, a video object segmentation and tracking framework is proposed for smart cameras in visual surveillance networks with two major contributions. First, we propose a robust threshold decision algorithm for video object segmentation with a multi-background model. Second, we propose a video object tracking framework based on a particle filter with the likelihood function composed of diffusion distance for measuring color histogram similarity and motion clue from video object segmentation. The proposed framework can track nonrigid moving objects under drastic changes in illumination and background clutter. Experimental results show that the presented algorithms perform well for several challenging sequences, and our proposed methods are effective for the aforementioned issues.