VidSeg-GAN
Video object segmentation aims to segment objects in a video sequence, given some user annotation which indicates the object of interest. Although Convolutional Neural Networks (CNNs) have been used in the recent past for the purpose of foreground segmentation in videos, adversarial training methods have not been used effectively to solve this problem, in spite of its extensive use for solving many other problems in Computer Vision. Earlier, flow features and motion trajectories have been extensively used to capture the temporal consistency between subsequent frames to segment moving objects in videos. However, we show that our proposed framework of processing the video frames independently using a deep generative adversarial network (GAN), is able to maintain the temporal coherency across frames without the use of any explicit trajectory based information, to provide superior results. Our main contribution lies in introducing a GAN based framework along with the incorporation of an Intersection-over-Union score based novel cost function for training the model, to solve the problem of foreground object segmentation in videos. The proposed method, when evaluated on popular real-world video segmentation datasets viz. DAVIS, SegTrack-v2 and YouTube-Objects, exhibits substantial performance gain over the recent state-of-the-art methods.
- Conference Article
1
- 10.1109/icosst48232.2019.9043975
- Dec 1, 2019
Object segmentation, detection and tracking in videos is one of the most important task of computer vision. It is necessary in all of the real time deployed surveillance systems. Various unsupervised and semi-supervised video object segmentation techniques have been implemented and shown efficient results. But all of these techniques process all of the frames of a video sequence, which requires a huge training data and results in a large computational time. In this paper, a semi-supervised technique is proposed which segments an object in a video by just processing a single frame of the sequence. In this framework, a fully convolutional network is used to separate the foreground from the image, create the mask of the object and then segments the object with the help of this mask. The foreground separation in a frame is done by using pre-trained network while, training and testing of rest of the network is done using a specified dataset named as DAVIS. The results show that, the proposed framework takes less computational time and has also improved the overall accuracy of video object segmentation by 10% as compared to previous techniques.
- Conference Article
43
- 10.1109/iccv.2017.544
- Oct 1, 2017
We address an essential problem in computer vision, that of unsupervised foreground object segmentation in video, where a main object of interest in a video sequence should be automatically separated from its background. An efficient solution to this task would enable large-scale video interpretation at a high semantic level in the absence of the costly manual labeling. We propose an efficient unsupervised method for generating foreground object soft masks based on automatic selection and learning from highly probable positive features. We show that such features can be selected efficiently by taking into consideration the spatio-temporal appearance and motion consistency of the object in the video sequence. We also emphasize the role of the contrasting properties between the foreground object and its background. Our model is created over several stages: we start from pixel level analysis and move to descriptors that consider information over groups of pixels combined with efficient motion analysis. We also prove theoretical properties of our unsupervised learning method, which under some mild constraints is guaranteed to learn the correct classifier even in the unsupervised case. We achieve competitive and even state of the art results on the challenging Youtube-Objects and SegTrack datasets, while being at least one order of magnitude faster than the competition. We believe that the strong performance of our method, along with its theoretical properties, constitute a solid step towards solving unsupervised discovery in video.
- Research Article
6
- 10.1016/j.image.2020.115858
- Apr 20, 2020
- Signal Processing: Image Communication
Video object tracking and segmentation with box annotation
- Conference Article
34
- 10.1109/wacv56688.2023.00172
- Jan 1, 2023
Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. $\mathcal{J}\& {\mathcal{F}}$, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference. Dataset annotations are available at: https://github.com/Ali2500/BURST-benchmark.
- Conference Article
5
- 10.1145/1186415.1186483
- Jan 1, 2004
Object segmentation in image sequences is one of the fundamental problems in computer vision and graphics. This problem is usually addressed either by discrete representations which are currently manifested by graph partitioning techniques, or by continuous methods typically referred to as active contours. In this work we take a unified approach by fitting splines to graph cuts. The strengths of this approach stem from the dual discrete and continuous representations and from allowing the user to refine the result of the cut by fitting a new spline to it and modifying its points. Segmentation of an object in video is performed by a series of updates to the control points and computation of a minimum graph cut. Usually the graph cut results in a discrete representation over which the user has no control, and which is not always our desired result. Therefore our approach is to fit a spline to the resulting cut in key-frames. This allows the user to change the control points of the spline and then perform additional iterations of cut computation.
- Research Article
76
- 10.1155/2021/5541134
- Jan 1, 2021
- Complexity
Computational visual perception, also known as computer vision, is a field of artificial intelligence that enables computers to process digital images and videos in a similar way as biological vision does. It involves methods to be developed to replicate the capabilities of biological vision. The computer vision’s goal is to surpass the capabilities of biological vision in extracting useful information from visual data. The massive data generated today is one of the driving factors for the tremendous growth of computer vision. This survey incorporates an overview of existing applications of deep learning in computational visual perception. The survey explores various deep learning techniques adapted to solve computer vision problems using deep convolutional neural networks and deep generative adversarial networks. The pitfalls of deep learning and their solutions are briefly discussed. The solutions discussed were dropout and augmentation. The results show that there is a significant improvement in the accuracy using dropout and data augmentation. Deep convolutional neural networks’ applications, namely, image classification, localization and detection, document analysis, and speech recognition, are discussed in detail. In‐depth analysis of deep generative adversarial network applications, namely, image‐to‐image translation, image denoising, face aging, and facial attribute editing, is done. The deep generative adversarial network is unsupervised learning, but adding a certain number of labels in practical applications can improve its generating ability. However, it is challenging to acquire many data labels, but a small number of data labels can be acquired. Therefore, combining semisupervised learning and generative adversarial networks is one of the future directions. This article surveys the recent developments in this direction and provides a critical review of the related significant aspects, investigates the current opportunities and future challenges in all the emerging domains, and discusses the current opportunities in many emerging fields such as handwriting recognition, semantic mapping, webcam‐based eye trackers, lumen center detection, query‐by‐string word, intermittently closed and open lakes and lagoons, and landslides.
- Conference Article
713
- 10.1109/cvpr.2017.372
- Jul 1, 2017
Inspired by recent advances of deep learning in instance segmentation and object tracking, we introduce the concept of convnet-based guidance applied to video object segmentation. Our model proceeds on a per-frame basis, guided by the output of the previous frame towards the object of interest in the next frame. We demonstrate that highly accurate object segmentation in videos can be enabled by using a convolutional neural network (convnet) trained with static images only. The key component of our approach is a combination of offline and online learning strategies, where the former produces a refined mask from the previous frame estimate and the latter allows to capture the appearance of the specific object instance. Our method can handle different types of input annotations such as bounding boxes and segments while leveraging an arbitrary amount of annotated frames. Therefore our system is suitable for diverse applications with different requirements in terms of accuracy and efficiency. In our extensive evaluation, we obtain competitive results on three different datasets, independently from the type of input annotation.
- Research Article
1
- 10.34028/iajit/22/1/3
- Jan 1, 2025
- The International Arab Journal of Information Technology
In the field of actual Video Object Segmentation (VOS), traditional techniques have poor adaptability and insufficient segmentation results. Therefore, based on existing problems, an Unsupervised Video Object Segmentation (UVOS) technique based on convolutional networks is proposed. Firstly, the method of decomposing expressions is used to handle the spatiotemporal relationship between the reference frame and the target frame, and video object reconstruction is achieved through similarity calculation. For target segmentation in motion scenes, a Single Linear Bottleneck Operator (SLBO) is introduced for feature extraction, and pooling compensation is used to optimize feature information loss. For general scene segmentation, a spatiotemporal similarity segmentation technique is introduced to achieve target video segmentation for complex scenes. In the foreground segmentation test of sports scenes, the Change Detection Benchmark Dataset 2014 (CDNet.20I4SM) dataset was selected to test the model's loss performance in different scenarios. In adverse weather scenario training, the proposed model tends to converge after 40 iterations, with a loss value of 0.276, which is superior to the Foreground image Segmentation (FgSegNet_), the Convolutional Networks for Biomedical Image Segmentation (MU Net), Cascade Convolutional Neural Network (Cascade CNN) models; In the accuracy test, the proposed FS-LBPC model tended to converge after 50 iterations, with a precision P-value of 0.963. It performed the best among the four segmentation models the FgSegNet_, MU Net, Cascade CNN, and a real-time Foreground Segmentation network based on single Linear Bottleneck and Pooling Compensation (FS-LBPC). Usually, the Densely Annotated VIdeo Segmentation (DAVIS16) dataset is selected for video scene segmentation, which has the best segmentation performance in horse racing and animal flight scenes, with segmentation accuracy of 0.976 and 0.965, respectively. In summary, the VOS technology has excellent application effects in practical scenarios, providing important technical references for the improvement of image and video processing and segmentation technology
- Research Article
20
- 10.1109/tip.2018.2859622
- Jul 30, 2018
- IEEE Transactions on Image Processing
It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.
- Conference Article
65
- 10.1109/iscas.1997.622202
- Jun 9, 1997
Object segmentation and tracking is a key component for new generation of digital video representation, transmission and manipulations. Example applications include content based video database and video editing. We present a general schema for video object modeling, which incorporates low level visual features and hierarchical grouping. The schema provides a general framework for video object extraction, indexing, and classification. In addition, we present new video segmentation and tracking algorithms based on salient color and affine motion features. Color feature is used for intra frame segmentation; affine motion is used for tracking image segments over time. Experimental evaluation results using several test video streams are included.
- Conference Article
3
- 10.1109/ictai52525.2021.00068
- Nov 1, 2021
Training a deep generative adversarial network (GAN) with hundreds or even thousands of layers is difficult. The backpropagation depth of generator is deeper than discriminator, leading it to occur vanishing/exploding gradients easily. This paper proposes a method to train deep vanilla GAN based on mean field theory. By adjusting the parameter variances and activation of the GAN, a 200-layer vanilla GAN can be trained steadily without adding any batch normalization layers or residual blocks. We demonstrate that deep GAN is very sensitive to the parameter variances $\sigma _w^2$ , $\sigma _b^2$ in the initialization scheme, and explain why hard tanh is more suitable than relu as an activation in a deep vanilla GAN. Experiments on the MNIST and Fashion-MNIST data sets validate that our method trains a deep vanilla GAN well and can produce high-quality images.
- Book Chapter
2
- 10.1007/978-3-319-70742-6_12
- Jan 1, 2017
This paper investigates how to exploit human feedback for interactive object segmentation in videos. In particular, we present an interactive video object segmentation approach where humans can contribute by either explicitly clicking on objects of interest in videos or implicitly while looking at video sequences. User feedback is then translated into a set of spatio-temporal constraints for an energy-based minimization problem. We tested the method on standard benchmarking datasets when using both eye-gaze data and user clicks. The results indicated how our method outperformed existing automated and interactive methods regardless of the type of human feedback (explicit or implicit), and that click-based feedback was more reliable than eye-gaze one.
- Research Article
19
- 10.5075/epfl-thesis-3411
- Jan 1, 2005
- Infoscience (Ecole Polytechnique Fédérale de Lausanne)
Quality assessment is a central issue in the design, implementation, and performance testing of all systems. Digital signal processing systems generally deal with visual information that are meant for human consumption. An image, a video, or a 3D model may go through different stages of processing before being presented to a human observer, and each stage of processing may introduce distortions that could reduce the quality of the final display. To conceive quantitative metrics that can automatically predict the perceived quality, the way humans perceive such distortions has to be taken into account and can be greatly beneficial for quality assessment. In general, an objective quality metric plays an important role in a broad range of applications, such as visual information acquisition, compression, analysis and watermarking. Quality metrics can be used to optimize algorithm parameter settings and to benchmark different processing systems and algorithms. In this dissertation, new objective quality metrics that take into account how distortions are perceived, are proposed and three different signal processing systems are considered: video watermarking, video object segmentation and 3D models watermarking. First, two new objective metrics for watermarked video quality assessment are proposed. Based on several different watermarking algorithms and video sequences, the most predominant distortions are identified as spatial noise and temporal flicker. Corresponding metrics are designed and their performance is tested through subjective experiments. Second, the problem of video object segmentation quality evaluation is discussed, proposing both subjective evaluation methodology and perceptual objective quality metric. Since a perceptual metric requires a good knowledge of the kinds of artifacts present in segmented video objects, the most typical artifacts are synthetically generated. Psychophysical experiments are carried out to study the perception of individual artifacts by themselves or combined. A new metric is proposed by combining the individual artifacts using the Minkowski metric and a linear model. An in-depth evaluation of the performance of the proposed method is carried out. The obtained perceptual metric is also used to benchmark different video object segmentation techniques for general frameworks as well as specific applications, ranging from object-based coding to video surveillance. Third, two novel metrics for watermarked 3D model quality assessment are proposed on the basis of two subjective experiments. The first psychophysical experiment is carried out to investigate the perception of distortions caused by watermarking 3D models. Two roughness estimation metrics have been devised to perceptually measure the amount of visual distortions introduced on the model's surface. The second psychophysical experiment is conducted in order to validate the two proposed metrics with other watermarking algorithms. All of the proposed metrics for the three kinds of visual information processing systems are based on the results of the psychophysical experiments. Subjective tests are carried out to study and characterize the impact of distortions on human perception. An evaluation of the performance of these perceptual metrics with respect to the most common state of the art objective metrics is performed. The comparison shows a better performance of the proposed perceptual metrics than that of the state of the art metrics. The performance is investigated in terms of correlation with subjective opinion. The results demonstrate that including the perception of distortions in objective metrics is a reliable approach and improve the performance of such metrics.
- Conference Article
- 10.1117/12.838820
- Jan 17, 2010
- Proceedings of SPIE, the International Society for Optical Engineering/Proceedings of SPIE
One of the most important problems in Computer Vision is the computation of the 2D projective transformation (homography) that maps features of planar objects in different images and videos. This computation is required by many applications such as image mosaicking, image registration, and augmented reality. The real-time performance imposes constraints on the methods used. In this paper, we address the real-time detection and tracking of planar objects in a video sequence where the object of interest is given by a reference image template. Most existing approaches for homography estimation are based on two steps: feature extraction (first step) followed by a combinatorial optimization method (second step) to match features between the reference template and the scene frame. This paper has two main contributions. First, for the detection part, we propose a feature point classification which is applied prior to performing the matching step in the process of homography calculation. Second, for the tracking part, we propose a fast method for the computation of the homography that is based on the transferred object features and their associated local rawbrightness. The advantage of this proposed scheme is a fast and accurate estimation of the homography.
- Research Article
8
- 10.1364/josaa.29.000928
- May 21, 2012
- Journal of the Optical Society of America A
One of the most important problems in computer vision is the computation of the two-dimensional projective transformation (homography) that maps features of planar objects in different images and videos. This computation is required by many applications such as image mosaicking, image registration, and augmented reality. The real-time performance imposes constraints on the methods used. In this paper, we address the real-time detection and tracking of planar objects in a video sequence where the object of interest is given by a reference image template. Most existing approaches for homography estimation are based on two steps: feature extraction (first step) followed by a combinatorial optimization method (second step) to match features between the reference template and the scene frame. This paper has two main contributions. First, we detect both planar and nonplanar objects via efficient object feature classification in the input images, which is applied prior to performing the matching step. Second, for the tracking part (planar objects), we propose a fast method for the computation of the homography that is based on the transferred object features and their associated local raw brightness. The advantage of the proposed schemes is a fast matching as well as fast and robust object registration that is given by either a homography or three-dimensional pose.