Click Carving: Interactive Object Segmentation in Images and Videos with Point Clicks
We present a novel form of interactive object segmentation called Click Carving which enables accurate segmentation of objects in images and videos with only a few point clicks. Whereas conventional interactive pipelines take the user’s initialization as a starting point, we show the value in the system taking lead even in initialization. In particular, for a given image or a video frame, the system precomputes a ranked list of thousands of possible segmentation hypotheses (also referred to as object region proposals) using appearance and motion cues. Then, the user looks at the top ranked proposals, and clicks on the object boundary to carve away erroneous ones. This process iterates (typically 2–3 times), and each time the system revises the top ranked proposal set, until the user is satisfied with a resulting segmentation mask. In the case of images, this mask is considered as the final object segmentation. However in the case of videos, the object region proposals rely on motion as well, and the resulting segmentation mask in the first frame is further propagated across the video to obtain a complete spatio-temporal object tube. On six challenging image and video datasets, we provide extensive comparisons with both existing work and simpler alternative methods. In all, the proposed Click Carving approach strikes an excellent of accuracy and human effort. It outperforms all similarly fast methods, and is competitive or better than those requiring 2–12 times the effort.
- Research Article
49
- 10.1609/hcomp.v4i1.13288
- Sep 21, 2016
- Proceedings of the AAAI Conference on Human Computation and Crowdsourcing
We present a novel form of interactive video object segmentation where a few clicks by the user helps the system produce a full spatio-temporal segmentation of the object of interest. Whereas conventional interactive pipelines take the user's initialization as a starting point, we show the value in the system taking the lead even in initialization. In particular, for a given video frame, the system precomputes a ranked list of thousands of possible segmentation hypotheses (also referred to as object region proposals) using image and motion cues. Then, the user looks at the top ranked proposals, and clicks on the object boundary to carve away erroneous ones. This process iterates (typically 2-3 times), and each time the system revises the top ranked proposal set, until the user is satisfied with a resulting segmentation mask. Finally, the mask is propagated across the video to produce a spatio-temporal object tube. On three challenging datasets, we provide extensive comparisons with both existing work and simpler alternative methods. In all, the proposed Click Carving approach strikes an excellent balance of accuracy and human effort. It outperforms all similarly fast methods, and is competitive or better than those requiring 2 to 12 times the effort.
- Research Article
1
- 10.31676/0235-2591-2024-2-53-62
- May 7, 2024
- Horticulture and viticulture
This article reports the results of research studies conducted in 2023–2024 on transfer learning of Segmentation Convolutional Neural Networks (Seg-CNN) models for classification, recognition, and segmentation of branches with apple fruits and stems in images. State-of-the-art convolutional neural network architectures, i.e., YOLOv8(n,s,m,l,x)-seg, were used for a detailed segmentation of biological objects in images of varying complexity and scale at the pixel level. An image dataset collected in the field using a GoPro HERO 11 camera was marked up for transfer model training. Data augmentation was performed, producing a total of 2500 images. Image markup was performed using the polygon annotation tool. As a result, polygonal contours around objects were created, outlines of branches, apple tree fruits, and stems were outlined, and segments of objects in the images were indicated. The objects were assigned the following classes: Apple branch, Apple fruit, and Apple stem. Binary classification metrics, such as Precision and Recall, as well as Mean Average Precision (mAP), were used to evaluate the performance of the trained models in recognizing branches with apple fruits and stems in images. The YOLOv8x-seg (mAP50 0.758) and YOLOv8l-seg (mAP50 0.74) models showed high performance in terms of all metrics in recognizing branches, apple fruit, and fruit stems in images, outperforming the YOLOv8n-seg (mAP50 0.7) model due to their more complex architecture. The YOLOv8n-seg model has a faster frame processing speed (11.39 frames/s), rendering it a preferred choice for computing systems with limited resources. The results obtained confirm the prospects of using machine learning algorithms and convolutional neural networks for segmentation and pixel-by-pixel classification of branches with apple fruits and stems on RGB images for monitoring the condition of plants and determining their geometric characteristics.
- Research Article
20
- 10.1109/tip.2018.2859622
- Jul 30, 2018
- IEEE Transactions on Image Processing
It is a challenging task to extract segmentation mask of a target from a single noisy video, which involves object discovery coupled with segmentation. To solve this challenge, we present a method to jointly discover and segment an object from a noisy video, where the target disappears intermittently throughout the video. Previous methods either only fulfill video object discovery, or video object segmentation presuming the existence of the object in each frame. We argue that jointly conducting the two tasks in a unified way will be beneficial. In other words, video object discovery and video object segmentation tasks can facilitate each other. To validate this hypothesis, we propose a principled probabilistic model, where two dynamic Markov networks are coupled-one for discovery and the other for segmentation. When conducting the Bayesian inference on this model using belief propagation, the bi-directional message passing reveals a clear collaboration between these two inference tasks. We validated our proposed method in five data sets. The first three video data sets, i.e., the SegTrack data set, the YouTube-objects data set, and the Davis data set, are not noisy, where all video frames contain the objects. The two noisy data sets, i.e., the XJTU-Stevens data set, and the Noisy-ViDiSeg data set, newly introduced in this paper, both have many frames that do not contain the objects. When compared with state of the art, it is shown that although our method produces inferior results on video data sets without noisy frames, we are able to obtain better results on video data sets with noisy frames.
- Research Article
2
- 10.1080/02533839.2009.9671524
- Apr 1, 2009
- Journal of the Chinese Institute of Engineers
In this paper, we present work on interactive object segmentation in digital images. The user can employ the proposed work to separate the target object from the background easily. After marker drawing of foreground and background from a user, the user interface shows the target object in original color and the background in some other specific color, respectively. Compared with the tedious steps of Adobe Photoshop, the user can interactively approximate the contour of the target object with more efficiency. Our proposed algorithm is based on Foreground/Background region classification by comparing the similarity of color information. At first, the user interface processes the input image by watershed segmentation to produce the segmented regions. Then, some unlabeled regions are assigned as foreground regions or background regions by marker drawing. After marker drawing, the remaining unlabeled regions are processed by Foreground/Background region classification. In our implementation, we also introduce hierarchical queues to store the unlabeled regions during the procedure of region classification. The target object is segmented after Foreground/Background region classification. In our experiments, the proposed algorithm provides output with high accuracy and low effort.
- Conference Article
- 10.1109/icip.2005.1529672
- Jan 1, 2005
Automatic segmentation of objects in images is an ongoing research problem with applications in many fields. If a scene is imaged serially over time, an advantage can be gained by using segmentation results from previous and subsequent images when segmenting the current image. This paper discusses a probabilistic framework for making use of temporal information in the segmentation process. A subset of dynamic Bayesian networks, the hidden Markov model is described as a means to improve segmentation over statistical classification techniques that use static pixel intensity information alone. An application of this technique to the segmentation of tumors in magnetic resonance images (MRIs) is described. The segmentation accuracy was increased compared to a popular 3D spatial only segmentation method.
- Research Article
3
- 10.1016/j.egypro.2011.11.284
- Jan 1, 2011
- Energy Procedia
Novel Image Segmentation Using Gaussian Mixture Models-- Application to Plant Phenotypic Analysis
- Conference Article
5
- 10.1145/1186415.1186483
- Jan 1, 2004
Object segmentation in image sequences is one of the fundamental problems in computer vision and graphics. This problem is usually addressed either by discrete representations which are currently manifested by graph partitioning techniques, or by continuous methods typically referred to as active contours. In this work we take a unified approach by fitting splines to graph cuts. The strengths of this approach stem from the dual discrete and continuous representations and from allowing the user to refine the result of the cut by fitting a new spline to it and modifying its points. Segmentation of an object in video is performed by a series of updates to the control points and computation of a minimum graph cut. Usually the graph cut results in a discrete representation over which the user has no control, and which is not always our desired result. Therefore our approach is to fit a spline to the resulting cut in key-frames. This allows the user to change the control points of the spline and then perform additional iterations of cut computation.
- Research Article
1
- 10.1016/j.softx.2024.101979
- Nov 23, 2024
- SoftwareX
Object detection and tracking are crucial components in the development of various applications and research endeavors within the computer science and robotics community. However, the diverse shapes and appearances of real-world objects, as well as dynamic nature of the scenes, may pose significant challenges for these tasks. Existing object detection and tracking methods often require extensive data annotation and model re-training when applied to new objects or environments, diverting valuable time and resources from the primary research objectives. In this paper, we present IST-ROS, Interactive Segmentation and Tracking for ROS, a software solution that leverages the capabilities of the Segment Anything Model (SAM) and semi-supervised video object segmentation methods to enable flexible and efficient object segmentation and tracking. Its graphical interface allows interactive object selection and segmentation using various prompts, while integrated tracking ensures robust performance even under occlusions and object interactions. By providing a flexible solution for object segmentation and tracking, IST-ROS aims to facilitate rapid prototyping and advancement of robotics applications.
- Dissertation
- 10.47749/t/unicamp.2015.961083
- Sep 28, 2015
Interactive segmentation of objects in images and videos using graphs and fuzzy models of content knowledge
- Book Chapter
2
- 10.1007/978-3-319-70742-6_12
- Jan 1, 2017
This paper investigates how to exploit human feedback for interactive object segmentation in videos. In particular, we present an interactive video object segmentation approach where humans can contribute by either explicitly clicking on objects of interest in videos or implicitly while looking at video sequences. User feedback is then translated into a set of spatio-temporal constraints for an energy-based minimization problem. We tested the method on standard benchmarking datasets when using both eye-gaze data and user clicks. The results indicated how our method outperformed existing automated and interactive methods regardless of the type of human feedback (explicit or implicit), and that click-based feedback was more reliable than eye-gaze one.
- Research Article
13
- 10.7763/ijiee.2011.v1.12
- Jan 1, 2011
- International Journal of Information and Electronics Engineering
A novel algorithm is proposed for background estimation using machine learning and statistical pattern recognition. Usually the segmentation of objects in images is achieved by identifying homogeneous regions in individual images or by finding motions of objects in videos. In this paper, we combine the advantages of these approaches for the estimation of background using only two images. The proposed algorithm uses the difference between images to obtain initial estimation of background and then to refine the estimation using machine learning and statistical pattern recognition. Experimental results have shown that the proposed algorithm can achieve promising performance in terms of accuracy and speed.
- Conference Article
10
- 10.1109/ccece.1999.808047
- May 9, 1999
Many objects in images of natural scenes are so complex that describing them by traditional techniques is inadequate. This paper presents a family of techniques suitable for texture analysis and segmentation of objects in aerial images. Texture has been one of the most important but difficult properties for image coding and compression. It is important because it describes the entire area of a region and provides the essential structure information in regions of an image. Our goal here is to decompose an image to texturally homogenous regions. An efficient technique for computing the fractal dimension of images is used. Three different techniques; the Hurst transform, the Sobel operator and the variance are applied to two images and the results are compared. It is shown that variance dimension converts the original image to one whose texture information permits simple thresholding for texture analysis and segmentation.
- Research Article
- 10.33633/jais.v8i3.9024
- Nov 30, 2023
- Journal of Applied Intelligent System
Segment Anything Model (SAM) is a model capable of performing object segmentation in images without requiring any additional training. Although the segmentation produced by SAM lacks high precision, this model holds interesting potential for more accurate segmentation tasks. In this study, we propose a Post-Processing method called Conditional Matting 4 (CM4) to enhance high-precision object segmentation, including prominent, occluded, and complex boundary objects in the segmentation results from SAM. The proposed CM4 Post-Processing method incorporates the use of morphological operations, DistilBERT, InSPyReNet, Grounding DINO, and ViTMatte. We combine these methods to improve the object segmentation produced by SAM. Evaluation is conducted using metrics such as IoU, SAD, MAD, Grad, and Conn. The results of this study show that the proposed CM4 Post-Processing method successfully improves object segmentation with a SAD evaluation score of 20.42 (a 27% improvement from the previous study) and an MSE evaluation score of 21.64 (a 45% improvement from the previous study) compared to the previous research on the AIM-500 dataset. The significant improvement in evaluation scores demonstrates the enhanced capability of CM4 in achieving high precision and overcoming the limitations of the initial segmentation produced by SAM. The contribution of this research lies in the development of an effective CM4 Post-Processing method for enhancing object segmentation in images with high precision. This method holds potential for various computer vision applications that require accurate and detailed object segmentation.
- Research Article
10
- 10.1016/j.eswa.2019.05.019
- May 15, 2019
- Expert Systems with Applications
Automatic trimap generation and artifact reduction in alpha matte using unknown region detection
- Conference Article
42
- 10.1145/500141.500150
- Oct 1, 2001
The segmentation of objects in video sequences constitutes a prerequisite for numerous applications ranging from computer vision tasks to second-generation video coding.We propose an approach for segmenting video objects based on motion cues. To estimate motion we employ the 3D structure tensor, an operator that provides reliable results by integrating information from a number of consecutive video frames. We present a new hierarchical algorithm, embedding the structure tensor into a multiresolution framework to allow the estimation of large velocities.The motion estimates are included as an external force into a geodesic active contour model, thus stopping the evolving curve at the moving object's boundary. A level set-based implementation allows the simultaneous segmentation of several objects.As an application based on our object segmentation approach we provide a video object classification system. Curvature features of the object contour are matched by means of a curvature scale space technique to a database containing preprocessed views of prototypical objects.We provide encouraging experimental results calculated on synthetic and real-world video sequences to demonstrate the performance of our algorithms.