Segmentation and tracking of video objects: suited to content-based video indexing, interactive television and production systems
This paper examines the problem of segmentation and tracking of video objects for a content-based information retrieval context. Our method starts first with an interactive video object selection, then alternately tracks and fits the object of interest as long as possible. A user-based selection is required in order to initialize the process, whereas an active contour model progressively refines the selection by fitting the natural edges of the object. The video object is thus tracked by using a hybrid structure combining a hierarchical mesh for the motion estimation between two frames and a multi-resolution active contour model. This contour model is derived directly from the mesh boundaries in order to reposition the snake's nodes onto the natural edges of the object.
- Conference Article
7
- 10.1109/icme.2000.871574
- Apr 28, 2017
This paper examines the problem of segmentation and tracking of video objects for content-based information retrieval. Segmentation and tracking of video objects plays an important role in index creation and user request definition steps. The object is initially selected using a semi-automatic approach. For this purpose, a user-based selection is required to define roughly the object to be tracked. In this paper, we propose two different methods to allow an accurate contour definition from the user selection. The first one is based on an active contour model which progressively refines the selection by fitting the natural edges of the object while the second used a binary partition tree with a marker and propagation approach. The video object is thus tracked by using a hybrid structure alternately combining a hierarchical mesh for the motion estimation between two frames and a multi-resolution active contour model. This contour model is derived directly from the mesh boundaries in order to reposition the snake's nodes onto the natural edges of the object. The object-based segmentation associated with object tracking allows relevant descriptors to be built for a future matching purpose.
- Research Article
33
- 10.1109/tcsvt.2013.2242595
- Jun 1, 2013
- IEEE Transactions on Circuits and Systems for Video Technology
Video object segmentation and tracking are two essential building blocks of smart surveillance systems. However, there are several issues that need to be resolved. Threshold decision is a difficult problem for video object segmentation with a multi-background model. In addition, some conditions make robust video object tracking difficult. These conditions include nonrigid object motion, target appearance variations due to changes in illumination, and background clutter. In this paper, a video object segmentation and tracking framework is proposed for smart cameras in visual surveillance networks with two major contributions. First, we propose a robust threshold decision algorithm for video object segmentation with a multi-background model. Second, we propose a video object tracking framework based on a particle filter with the likelihood function composed of diffusion distance for measuring color histogram similarity and motion clue from video object segmentation. The proposed framework can track nonrigid moving objects under drastic changes in illumination and background clutter. Experimental results show that the presented algorithms perform well for several challenging sequences, and our proposed methods are effective for the aforementioned issues.
- Research Article
40
- 10.1109/tcsvt.2004.828347
- Jun 1, 2004
- IEEE Transactions on Circuits and Systems for Video Technology
Segmenting and tracking of objects in video is of great importance for video-based encoding, surveillance, and retrieval. However, the inherent difficulty of object segmentation and tracking is to distinguish changes in the displacement of objects from disturbing effects such as noise and illumination changes. Therefore, in this paper, we formulate a color-based deformable model which is robust against noisy data and changing illumination. Computational methods are presented to measure color constant gradients. Further, a model is given to estimate the amount of sensor noise through these color constant gradients. The obtained uncertainty is subsequently used as a weighting term in the deformation process. Experiments are conducted on image sequences recorded from three-dimensional scenes. From the experimental results, it is shown that the proposed color constant deformable method successfully finds object contours robust against illumination, and noisy, but homogeneous regions.
- Conference Article
6
- 10.1109/ism.2010.20
- Dec 1, 2010
This paper presented a video moving object segmentation and tracking system based on the active contour and the color classification models. First, the active contour model is applied to segment the target object in the initial frame. From the segmented object, the object and background regions are extracted. Then the object and the background regions are separately clustered according to color feature by using the K-means algorithm. Subsequently, the video object in the next frame is automatically tracked by using temporal differencing and block matching. The moving and stationary regions in a frame are estimated by the temporal differencing. In the moving regions, pixels are obtained their classification from the previous frame using block matching while they are directly received their classification from the previous frame in the stationary regions. Experimental results show that the proposed method provides better performance than the active contour method applied in video object tracking.
- Book Chapter
1
- 10.1007/978-981-19-1018-0_57
- Jan 1, 2022
Moving object segmentation and detection have become an important topic in computer perspective. As such, it is widely used in video surveillance such as driving assistance program, robots, traffic monitoring, and crime pattern identification. In addition, video object tracking is an important function in video surveillance systems because it provides temporary interactive information about moving objects. An important function of video object segmentation is to find and separate important elements in the video frame behind the domain. The purpose of video tracking is to combine targeted objects into consecutive video frames. First of all, enhanced threshold filtered video object detection and tracking (TFVODT) is designed to classify objects according to their size, color, and to get better accuracy of video object detection. Initially, the TFVODT framework distinguishes a video object by its characteristics such as size and color. The TFVODT framework performs the function of distinguishing an object through the median filter-based enhanced Laplacian thresholding process. Along with the support of the split object, the TFVODT framework does well to track the video object. Second, threshold filtered video object detection and tracking (ITFVODT) is developed to distinguish video’s elements based on their features such as texture, durability, and performance of video object detection. All video frames found in the ITFVODT framework contain the similar features as quality and contrast.KeywordsObject trackingITFVODTTFVODTEMFVDSegmentation
- Research Article
194
- 10.1145/3391743
- May 25, 2020
- ACM Transactions on Intelligent Systems and Technology
Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.
- Research Article
8
- 10.1016/j.image.2007.09.001
- Sep 19, 2007
- Signal Processing: Image Communication
Video object segmentation and tracking using region-based statistics
- Research Article
6
- 10.1016/j.image.2020.115858
- Apr 20, 2020
- Signal Processing: Image Communication
Video object tracking and segmentation with box annotation
- Research Article
5
- 10.1117/1.jei.25.6.061612
- Nov 9, 2016
- Journal of Electronic Imaging
This paper presents an algorithm for automatic segmentation of moving objects in video based on spatiotemporal visual saliency and an active contour model. Our algorithm exploits the visual saliency and motion information to build a spatiotemporal visual saliency map used to extract a moving region of interest. This region is used to automatically provide the seeds for the convex active contour (CAC) model to segment the moving object accurately. The experiments show a good performance of our algorithm for moving object segmentation in video without user interaction, especially on the SegTrack dataset.
- Conference Article
34
- 10.1109/wacv56688.2023.00172
- Jan 1, 2023
Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. $\mathcal{J}\& {\mathcal{F}}$, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference. Dataset annotations are available at: https://github.com/Ali2500/BURST-benchmark.
- Research Article
82
- 10.1109/tcsvt.2002.808089
- Jan 1, 2003
- IEEE Transactions on Circuits and Systems for Video Technology
Video object segmentation and tracking are essential for content-based video processing. This paper presents a framework for a semiautomatic approach to this problem. A semantic video object is initialized with human assistance in a key frame. The video object is then tracked and segmented automatically in the following frames. A new active contour model, VSnakes, is introduced as a segmentation method in this framework. The active contour energy is defined so as to reflect the energy difference between two contours instead of the energy of a single contour. Multiple-resolution wavelet decomposition is applied in generating the edge energy of the image frame. Contour relaxation is used to deal with the object deformation frame by frame, and the Viterbi algorithm is used to update the contour path during contour relaxation. Compared to the original snakes algorithm, semiautomatic video object segmentation with the VSnakes algorithm resulted in improved performance in terms of video object shape distortion (1.4% versus 2.9% in one experiment), which suggests that it could be a useful tool in many content-based video applications, e.g., MPEG-4 video object generation and medical imaging.
- Conference Article
9
- 10.1109/icmlc.2008.4620823
- Jul 1, 2008
As a critical step in many multimedia applications, shot boundary detection has attracted many research interests in recent years. The most of existing methods measure the similarity among video frames based on its low-level feathers. However, they are sensitive to the change in not only brightness, color, motion of object, but also camera motions and the quality of video. This paper proposes an innovative shot boundary detection method for news video based on video object segmentation and tracking. It combines three main techniques: the partitioned histogram comparison method, the video object segmentation and tracking based on wavelet analysis. The partitioned histogram comparison is used as the first filter to effectively reduce the number of video frames which need object segmentation and tracking. The unsupervised video object segmentation and tracking based on wavelet analysis is robust to those problems mentioned above. The efficacy of the proposed method is extensively tested with more than 3 hours of CCTV and CNN news programs, and that 96.4% recall with 97.2% precision have been achieved.
- Supplementary Content
- 10.6342/ntu.2004.01450
- Jan 1, 2004
- 臺灣大學電子工程學研究所學位論文
Digital video technology has played an essential role in our daily life for entertainment, communication, surveillance, and intelligent human-machine interfaces. In this dissertation, algorithms and architectures of core techniques for both current and future video applications are discussed in three different parts: block matching motion estimation, H.264/AVC encoding systems, and intelligent video signal processing. Motion estimation (ME) is the heart of video coding systems. It is the most important module and demands the most computing power and memory access in a video encoder. In Part I of this dissertation, we first made a comprehensive survey of ME algorithms and architectures during the last two decades (1981-2004). All fast block matching algorithms (BMAs) are classified into six categories, and many of them are compared in terms of video quality and computational complexity, which provides useful guidelines for software applications. Many architectures supporting full search or fast search are introduced, and comparisons of representative designs are presented in six aspects by hexagonal plots for clear evaluation. Second, we proposed a global elimination algorithm (GEA) for fast block matching. The main concept of GEA is to divide the block matching into an initial scan of all search positions with coarse matching of candidates, followed by fine matching of candidates which are the potential ones in the initial scan. While preserving the same quality as full search, GEA has less than 10% of full search complexity. The corresponding GEA architecture comprising a systolic part to extract coarse features, a parallel sum of absolute differences (SAD) tree to perform matching operations, and a parallel comparator tree to find the potential candidates, is also developed. Moreover, we further proposed a parallel global elimination algorithm (PGEA) and its corresponding architecture for higher specifications. Our design is 10 times more area-speed efficient than full search architectures. Third, we proposed a computation-aware (CA) BMA to obtain better motion vectors with real-time constraints in a computation-limited and computation-variant environment. Different from prior CA BMAs in which random access of macroblocks is inevitable, our one-pass flow can not only significantly reduce the memory size but also effectively utilize the context information of neighboring macroblocks to achieve faster speed and better quality. Moreover, video quality can be further improved with the adaptive search strategy. Our one-pass algorithm can save 70% of the processing time while obtaining the same quality in comparison with prior CA BMAs. H.264/AVC is the latest international video coding standard. It can save 39%, 49%, and 64% of bitrates in comparison with MPEG-4, H.263, and MPEG-2, respectively. In Part II of this dissertation, we first proposed a context-based adaptive method to speed up the multi-frame ME, which is the most computationally intensive part in an H.264/AVC encoder. Statistical analysis is applied to the available information after intra prediction and the block matching process for the previous reference frame. Context-based adaptive criteria are then derived to determine whether it is worth searching more reference frames. Full search quality can be maintained while 76%-96% of unnecessary reference frames can be omitted. Second, we proposed an H.264/AVC intra frame coding fast algorithm and an H.264/AVC intra coder architecture. Context-based decimation of unlikely candidates, subsampling of matching operations, and interleaved full-search/partial-search strategy are adopted in the software implementation, which can reduce 45% of the total computation while keeping the PSNR degradation less than 0.3dB. As for the hardware accelerator, a four-parallel system architecture is designed with comprehensive analysis. A prototype chip with core size of 1.855x1.885mm2, which can process 16mega-pixels within one second at 54MHz, is fabricated using 0.25μm CMOS technology. Third, we proposed the first H.264/AVC single-chip encoder in the world. The core size is 7.68x4.13mm2 with 0.18μm CMOS technology. A new four-stage macroblock pipelining architecture encodes HDTV720p (1280x720) 30frames/s videos in real time at 108MHz. The new pipelining doubles the throughput and utilization of the conventional two-stage macroblock pipelining. The encoder contains five engines for integer motion estimation (IME), fractional motion estimation (FME), intra prediction (IP), entropy coding (EC), and deblocking (DB).We contributed many novel ideas to overcome the tough design challenges (3.6TOPS of computing power and 5.6TB/s of memory access on a processor). Intelligent video signal processing is the driving force of advanced video applications, and video object segmentation is the most important pre-processing unit for object-based MPEG-4, object tracking, face recognition, sprite generation, MPEG-7 multimedia description, ...etc. In Part III of this dissertation, we first reviewed an efficient algorithm of video object segmentation. The background registration is the main idea, which can easily solve the still object problem and the uncovered background problem encountered by conventional change detection. With optimized implementation, a 450MHz Pentium III CPU can process 25 QCIF (176x144) frames in one second. Moreover, the elimination of shadow effects, combination with predictive watersheds for more accurate object boundaries, and global motion compensation for slight camera motion are also considered as enhancements of the baseline mode. Second, we proposed a simple but effective algorithm for a pan-tilt camera to automatically track one moving object. The proposed tracking algorithm collects the background information at the grid points of camera positions and then compares the captured frame with the background at a grid point for determining the next grid point. A moving object is thus kept in the middle of the image. Block-based processing and skin color detection are used to reduce computation and to favor human faces, respectively. Many practical situations are tested, and our tracking algorithm has been successfully integrated into a commercial surveillance IP camera. Third, we proposed a low complexity descriptor-based face recognition. Descriptors with translation-, rotation-, and scaling-invariant properties are used as the input vectors to the feature extraction kernel instead of raster scanned image pixels, making our method much more reliable than conventional pixel-based algorithms. What is more, the computational complexity and the memory requirement are significantly reduced by millions of times due to the dimension reduction of input vectors and the covariance matrix. The processing time to calculate the projection directions is reduced from several ten hours to a few seconds. In brief, digital video techniques are contributed in three directions. The proposed motion estimation can be applied in all video coding standards. The proposed H.264/AVC encoding system is the leading design in the world and brings many new concepts. The proposed video segmentation, object tracking, and face recognition will play the key roles of structured videos and intelligent surveillance systems. We sincerely hope that our research results can make progress for the convenience of human life.
- Conference Article
1
- 10.1109/icosst48232.2019.9043975
- Dec 1, 2019
Object segmentation, detection and tracking in videos is one of the most important task of computer vision. It is necessary in all of the real time deployed surveillance systems. Various unsupervised and semi-supervised video object segmentation techniques have been implemented and shown efficient results. But all of these techniques process all of the frames of a video sequence, which requires a huge training data and results in a large computational time. In this paper, a semi-supervised technique is proposed which segments an object in a video by just processing a single frame of the sequence. In this framework, a fully convolutional network is used to separate the foreground from the image, create the mask of the object and then segments the object with the help of this mask. The foreground separation in a frame is done by using pre-trained network while, training and testing of rest of the network is done using a specified dataset named as DAVIS. The results show that, the proposed framework takes less computational time and has also improved the overall accuracy of video object segmentation by 10% as compared to previous techniques.
- Research Article
15
- 10.1016/j.jvcir.2015.07.010
- Jul 15, 2015
- Journal of Visual Communication and Image Representation
GPU-Accelerated Video Background Subtraction Using Gabor Detector