Deep Snake for Real-Time Instance Segmentation
This paper introduces a novel contour-based approach named deep snake for real-time instance segmentation. Unlike some recent methods that directly regress the coordinates of the object boundary points from an image, deep snake uses a neural network to iteratively deform an initial contour to match the object boundary, which implements the classic idea of snake algorithms with a learning-based approach. For structured feature learning on the contour, we propose to use circular convolution in deep snake, which better exploits the cycle-graph structure of a contour compared against generic graph convolution. Based on deep snake, we develop a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation, which can handle errors in object localization. Experiments show that the proposed approach achieves competitive performances on the Cityscapes, KINS, SBD and COCO datasets while being efficient for real-time applications with a speed of 32.3 fps for 512 x 512 images on a 1080Ti GPU. The code is available at https://github.com/zju3dv/snake/.
- Conference Article
109
- 10.1109/cvpr52688.2022.00440
- Jun 1, 2022
Contour-based instance segmentation methods have developed rapidly recently but feature rough and hand-crafted front-end contour initialization, which restricts the model performance, and an empirical and fixed backend predicted-label vertex pairing, which contributes to the learning difficulty. In this paper, we introduce a novel contour-based method, named E2EC, for high-quality instance segmentation. Firstly, E2EC applies a novel learnable contour initialization architecture instead of hand-crafted contour initialization. This consists of a contour initialization module for constructing more explicit learning goals and a global contour deformation module for taking advantage of all of the vertices' features better. Secondly, we propose a novel label sampling scheme, named multi-direction alignment, to reduce the learning difficulty. Thirdly, to improve the quality of the boundary details, we dynamically match the most appropriate predicted-ground truth vertex pairs and propose the corresponding loss function named dynamic matching loss. The experiments showed that E2EC can achieve a state-of-the-art performance on the KITTI INStance (KINS) dataset, the Semantic Boundaries Dataset (SBD), the Cityscapes and the COCO dataset. E2EC is also efficient for use in real-time applications, with an inference speed of 36 fps for <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$512\times 512$</tex> images on an NVIDIA A6000 GPU. Code will be released at https://github.com/zhang-tao-whu/e2ec.
- Research Article
8
- 10.1016/j.patrec.2022.05.025
- Jul 1, 2022
- Pattern Recognition Letters
Contour deformation network for instance segmentation
- Research Article
18
- 10.1109/access.2023.3256723
- Jan 1, 2023
- IEEE Access
Fully-supervised object detection and instance segmentation models have accomplished notable results on large-scale computer vision benchmark datasets. However, fully-supervised machine learning algorithms’ performances are immensely dependent on the quality of the training data. Preparing computer vision datasets for object detection and instance segmentation is a labor-intensive task requiring each instance in an image to be annotated. In practice, this often results in the quality of bounding box and polygon mask annotations being suboptimal. This paper quantifies empirically the ground truth annotation quality and COCO’s mean average precision (mAP) performance by introducing two separate noise measures, uniform and radial, into the ground truth bounding box and polygon mask annotations for the COCO and Cityscapes datasets. Mask-RCNN models are trained on various levels of noise measures to investigate the performance of each level of noise. The results showed degradation of mAP as the level of both noise measures increased. For object detection and instance segmentation respectively, using the highest level of noise measure resulted in a mAP degradation of 0.185 & 0.208 for uniform noise with reductions of 0.118 & 0.064 for radial noise on the COCO dataset. As for the Cityscapes datasets, reductions of mAP performance of 0.147 & 0.142 for uniform noise and 0.101 & 0.033 for radial noise were recorded. Furthermore, a decrease in average precision is seen across all classes, with the exception of the class motorcycle. The reductions between classes vary, indicating the effects of annotation uncertainty are class-dependent.
- Research Article
1
- 10.14569/ijacsa.2023.0141058
- Jan 1, 2023
- International Journal of Advanced Computer Science and Applications
To address the problems of missed detection, segmentation error and poor target edge segmentation in the instance segmentation model, a R2SC-Yolact++ instance segmentation approach based on the improved Yolact++ is proposed. Firstly, the backbone network adopts Res2Net which introduces spatial attention mechanism (SAM) to improve the problem of segmentation error by better extracting feature information; then, high-quality masks are obtained by fusing the detail information of the shallow feature P2 as the input to the prototype mask branch; finally, the problem of missed detection was solved by introducing Cluster-NMS in order to improve the accuracy of the detection boxes. In order to illustrate the effectiveness of the improved model, experiments were conducted on two publicly available datasets, the COCO and CVPPP datasets. The experimental results show that the accuracy on the COCO dataset is 1.1% higher than the original model. And the accuracy on the CVPPP dataset is 1.7% better than before the improvement, which is better than other mainstream instance segmentation algorithms such as Mask RCNN. Finally, the improved model is applied to the insulator dataset, which can segment the shed of insulator accurately.
- Research Article
37
- 10.1109/tcsvt.2021.3063377
- Mar 5, 2021
- IEEE Transactions on Circuits and Systems for Video Technology
Instance segmentation needs to locate all instances in an image correctly and segment each instance precisely. Currently, the most dominant methods for instance segmentation take object detection as a pre-task. However, they rely on the accuracy of object detection incredibly. If the pre-task cannot predict an accurate bounding box, the performance of instance segmentation will degenerate. In this paper, we present a novel method for instance segmentation to solve this problem, which is called <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S</b> egmenting <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">B</b> eyond the <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">B</b> ounding <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">B</b> ox ( <bold xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">S3B-Net</b> ). Our S3B-Net designs a sub-network to help instance segmentation methods based on object detection to segment the part of an instance beyond the bounding box. Specifically, the sub-network first predicts a two-dimensional pixel embedding for each pixel. Then, the Gaussian function is employed to calculate a pixel’s probability belongs to a corresponding instance according to the two-dimensional pixel embedding. Finally, the output of the sub-network combines with the output of instance segmentation based on object detection to generate a more precise instance mask. Our sub-network can easily extend on the existing instance segmentation method based on object detection to segment instance beyond the bounding box. We do our experiments on dominant instance segmentation datasets, such as the COCO dataset and Cityscapes dataset. The results show that our method can achieve 6.8 points gain compared with the baseline Mask R-CNN with ResNet-50-FPN in Cityscapes datasets, and 1.7 points gain with ResNet-101-FPN-DCN in COCO datasets. Our S3B-Net outperforms the previous state-of-the-art instance segmentation method, which proves our method is competitive. The source code of our method will be made available.
- Book Chapter
256
- 10.1007/978-3-030-58568-6_39
- Jan 1, 2020
Tremendous efforts have been made to improve mask localization accuracy in instance segmentation. Modern instance segmentation methods relying on fully convolutional networks perform pixel-wise classification, which ignores object boundaries and shap, leading coarse and indistinct mask prediction results and imprecise localization. To remedy these problems, we propose a conceptually simple yet effective Boundary-preserving Mask R-CNN (BMask R-CNN) to leverage object boundary information to improve mask localization accuracy. BMask R-CNN contains a boundary-preserving mask head in which object boundary and mask are mutually learned via feature fusion blocks. As a result, the predicted masks are better aligned with object boundaries. Without bells and whistles, BMask R-CNN outperforms Mask R-CNN by a considerable margin on the COCO dataset; in the Cityscapes dataset, there are more accurate boundary groundtruths available, so that BMask R-CNN obtains remarkable improvements over Mask R-CNN. Besides, it is not surprising to observe that BMask R-CNN obtains more obvious improvement when the evaluation criterion requires better localization (e.g.., AP $$_{75}$$ ) as shown in Fig. 1. Code and models are available at https://github.com/hustvl/BMaskR-CNN .
- Research Article
- 10.1038/s41598-025-18845-7
- Sep 29, 2025
- Scientific Reports
In autonomous driving, instance segmentation is crucial for detecting and segmenting pedestrians and vehicles in road scenes. However, autonomous driving technology faces complex and diverse scene information, and the existing contour-based methods have imprecise initial contours with significant errors. The subsequent contour deformation modules cannot correct errors from the previous iterations, increasing the learning difficulty. Therefore, this research proposes two novel methods: Contour Initialization based on Instance Center Features (CIICF) and Contour Deformation based on Differentiation Module (CDDM). CIICF leverages instance center features to predict distances between instance contours and centers, thereby enhancing the initial contour representation’s accuracy. And CDDM substitutes circular convolution with zero-padded one-dimensional convolution which allows contour points to inherently learn their positions relative to the entire contour. Additionally, we incorporate absolute position encoding into the feature map to improve the model’s positional awareness. The superiority of our method was validated on public datasets such as Cityscapes, KINS, and SBD. Compared to the baseline CIICF-Deep Snake model, the final contour AP, AP_{50}, AP_{70}increased by 6.9%, 5.1% and 9.1% respectively. Moreover, the final contour generation speed enhanced from 43.2 Frames Per Second (FPS) to 49.8 FPS.
- Research Article
1
- 10.3390/s22176499
- Aug 29, 2022
- Sensors (Basel, Switzerland)
Instance segmentation has been developing rapidly in recent years. Mask R-CNN, a two-stage instance segmentation approach, has demonstrated exceptional performance. However, the masks are still very coarse. The downsampling operation of the backbone network and the ROIAlign layer loses much detailed information, especially from large targets. The sawtooth effect of the edge mask is caused by the lower resolution. A lesser percentage of boundary pixels leads to not-fine segmentation. In this paper, we propose a new method called Boundary Refine (BRefine) that achieves high-quality segmentation. This approach uses FCN as the foundation segmentation architecture, and forms a multistage fusion mask head with multistage fusion detail features to improve mask resolution. However, the FCN architecture causes inconsistencies in multiscale segmentation. BRank and sort loss (BR and S loss) is proposed to solve the problems of segmentation inconsistency and the difficulty of boundary segmentation. It is combined with rank and sort loss, and boundary region loss. BRefine can handle hard-to-partition boundaries and output high-quality masks. On the COCO, LVIS, and Cityscapes datasets, BRefine outperformed Mask R-CNN by 3.0, 4.2, and 3.5 AP, respectively. Furthermore, on the COCO dataset, the large objects improved by 5.0 AP.
- Book Chapter
3
- 10.1007/978-3-031-21014-3_31
- Jan 1, 2022
Circle representation has recently been introduced as a “medical imaging optimized" representation for more effective instance object detection on ball-shaped medical objects. With its superior performance on instance detection, it is appealing to extend the circle representation to instance medical object segmentation. In this work, we propose CircleSnake, a simple end-to-end circle contour deformation-based segmentation method for ball-shaped medical objects. Compared to the prevalent DeepSnake method, our contribution is threefold: (1) We replace the complicated bounding box to octagon contour transformation with a computation-free and consistent bounding circle to circle contour adaption for segmenting ball-shaped medical objects; (2) Circle representation has fewer degrees of freedom (DoF = 2) as compared with the octagon representation (DoF = 8), thus yielding a more robust segmentation performance and better rotation consistency; (3) To the best of our knowledge, the proposed CircleSnake method is the first end-to-end circle representation deep segmentation pipeline method with consistent circle detection, circle contour proposal, and circular convolution. The key innovation is to integrate the circular graph convolution with circle detection into an end-to-end instance segmentation framework, enabled by the proposed simple and consistent circle contour representation. Glomeruli are used to evaluate the performance of the benchmarks. From the results, CircleSnake increases the average precision of glomerular detection from 0.559 to 0.614. The Dice score increased from 0.804 to 0.849. The code has been released: .KeywordsInstance segmentationGraph convolutionPathologySnake
- Research Article
7
- 10.1111/exsy.13504
- Nov 14, 2023
- Expert Systems
The edges of objects are of great significance to the task of instance segmentation. However, most of the current popular deep neural networks do not pay much attention to the object edge information. More importantly, using the down‐sampling pooling layer in the deep learning network, the edge detail information of the object will be lost. To address this issue, inspired by the manual annotation process, we propose Mask Point R‐CNN aiming at promoting the neural network's attention to the object boundary. Specifically, we introduce the auxiliary task of object contour point detection on the Mask R‐CNN framework, which can effectively improve the gradient flow between different tasks by multi‐task learning and repairing objects' boundary information via feature fusion. Consequently, the model can be more sensitive to the edges of the object and capture more geometric features. Quantitatively, the experimental results show that our Mask Point R‐CNN outperforms vanilla Mask R‐CNN by 3.8% on the Cityscapes dataset and 0.8% on the COCO dataset.
- Research Article
5
- 10.1609/aaai.v38i7.28555
- Mar 24, 2024
- Proceedings of the AAAI Conference on Artificial Intelligence
Video instance segmentation on mobile devices is an important yet very challenging edge AI problem. It mainly suffers from (1) heavy computation and memory costs for frame-by-frame pixel-level instance perception and (2) complicated heuristics for tracking objects. To address these issues, we present MobileInst, a lightweight and mobile-friendly framework for video instance segmentation on mobile devices. Firstly, MobileInst adopts a mobile vision transformer to extract multi-level semantic features and presents an efficient query-based dual-transformer instance decoder for mask kernels and a semantic-enhanced mask decoder to generate instance segmentation per frame. Secondly, MobileInst exploits simple yet effective kernel reuse and kernel association to track objects for video instance segmentation. Further, we propose temporal query passing to enhance the tracking ability for kernels. We conduct experiments on COCO and YouTube-VIS datasets to demonstrate the superiority of MobileInst and evaluate the inference latency on one single CPU core of the Snapdragon 778G Mobile Platform, without other methods of acceleration. On the COCO dataset, MobileInst achieves 31.2 mask AP and 433 ms on the mobile CPU, which reduces the latency by 50% compared to the previous SOTA. For video instance segmentation, MobileInst achieves 35.0 AP and 30.1 AP on YouTube-VIS 2019 & 2021.
- Book Chapter
743
- 10.1007/978-3-030-58452-8_17
- Jan 1, 2020
We propose a simple yet effective instance segmentation framework, termed CondInst (conditional convolutions for instance segmentation). Top-performing instance segmentation methods such as Mask R-CNN rely on ROI operations (typically ROIPool or ROIAlign) to obtain the final instance masks. In contrast, we propose to solve instance segmentation from a new perspective. Instead of using instance-wise ROIs as inputs to a network of fixed weights, we employ dynamic instance-aware networks, conditioned on instances. CondInst enjoys two advantages: (1) Instance segmentation is solved by a fully convolutional network, eliminating the need for ROI cropping and feature alignment. (2) Due to the much improved capacity of dynamically-generated conditional convolutions, the mask head can be very compact (e.g., 3 conv. layers, each having only 8 channels), leading to significantly faster inference. We demonstrate a simpler instance segmentation method that can achieve improved performance in both accuracy and inference speed. On the COCO dataset, we outperform a few recent methods including well-tuned Mask R-CNN baselines, without longer training schedules needed. Code is available: https://git.io/AdelaiDet .
- Conference Article
6
- 10.13031/aim.202100174
- Jan 1, 2021
- 2021 ASABE Annual International Virtual Meeting, July 12-16, 2021
<b><sc>Abstract.</sc></b> Seed phenotyping is the idea of analyzing the morphometric characteristics of a seed to predict the behavior of the seed in terms of development, tolerance and yield in various environmental conditions. The performance of seed phenotyping requires that the morphometry of seeds be estimated which can be a complex task to perform considering that seeds, even of a certain variety, are not uniform. As a result, the manual estimation of seed morphometry requires a ginormous amount of man-power. In recent times, applications, both mobile and desktop, that estimate seed morphometry from images have become available. While the applications alleviate the problem to a degree, one key problem is the segmentation of clustered seeds on images. It is often the case that the seeds on the image are in contact with each other which makes it hard to distinguish one seed from another. This phenomenon inevitably leads to erroneous estimates of seed morphometry. Recent developments in the field of machine learning have led to the development of neural networks that perform object detection and instance segmentation. The focus of the work is the application and feasibility analysis of the state-of-the-art object detection and localization neural networks, Mask R-CNN and YOLO (You Only Look Once), for seed phenotyping using Tensorflow. One of the major bottlenecks of such an endeavor is the need for a large amount of training data. While the capture of a multitude of seed images is taunting, the images are also required to be annotated to indicate the boundaries of the seeds on the image and converted to data formats that the neural networks are able to consume. Although tools that manually perform the task of annotation are available for free, the amount of time required is enormous. In order to tackle such a scenario, the idea of domain randomization i.e. the technique of applying models trained on images containing simulated objects to real-world objects, is considered. Besides, transfer learning i.e. the idea of applying the knowledge obtained while solving a problem to a different problem, is used. The networks are trained on pre-trained weights from the popular ImageNet and COCO data sets. Five types of seeds i.e. canola, rough rice, sorghum, soy and wheat, are experimented with, as part of the work. The performance of the technique is evaluated using average precision and recall. In order to apply domain randomization, a sample of 40 seeds of each type is considered. Images of each of the seeds are captured and then laid on a uniform background in different orientations and sizes. In essence, this procedure creates a multitude of training images that are used to train the neural networks. This technique ensures that the user does need to possess numerous seeds to train the neural networks. Also, a plethora of training images can be generated on-demand with a desired number of seeds on each image. The ability to scale and orient the seed instances on the images means that the neural networks can be trained to be scale-invariant, a common problem in image processing. Upon the segmentation of seeds, a technique that closely follows the guidelines laid out by the International Seed Morphology Association is proposed to estimate the morphometry of the seeds using the neural networks. Briefly, the standard US government issued coin, the Penny, is used. Since the dimensions of the penny are known in advance, a relationship between the coin morphometry in pixels and metric units is established. This relationship is later leveraged to perform a simple cross-multiplication to yield the morphometry of the seeds in question.
- Conference Article
76
- 10.1109/wacv48630.2021.00039
- Jan 1, 2021
Contour-based instance segmentation methods are attractive due to their efficiency. However, existing contour-based methods either suffer from lossy representation, complex pipeline or difficulty in model training, resulting in sub-par mask accuracy on challenging datasets like MS-COCO. In this work, we propose a novel deep attentive contour model, named DANCE, to achieve better instance segmentation accuracy while remaining good efficiency. To this end, DANCE applies two new designs: attentive contour deformation to refine the quality of segmentation contours and segment-wise matching to ease the model training. Comprehensive experiments demonstrate DANCE excels at deforming the initial contour in a more natural and efficient way towards the real object boundaries. Effectiveness of DANCE is also validated on the COCO dataset, which achieves 38.1% mAP and outperforms all other contour-based instance segmentation models. To the best of our knowledge, DANCE is the first contour-based model that achieves comparable performance to pixel-wise segmentation models. Code is available at https://github.com/lkevinzc/dance.
- Dissertation
- 10.17760/d20439211
- Aug 24, 2022
Instance segmentation algorithms are used everywhere, be it self driving cars, scene mapping by autonomous robots or analyzing medical scans. Instance segmentation can be thought of as further refinement of semantic segmentation. Object detection algorithms try to detect objects from the scene by enclosing them in bounding boxes, semantic segmentation tries to label these objects, whereas instance segmentation tries to label each unique instance of these objects. The task is quite complex and becomes even more challenging when the scope is microscopic data. Objects in microscopic data do not usually follow a fixed shape or orientation, therefore it becomes very difficult to identify unique instances of these objects using axis aligned bounding boxes. The alternative approach that researchers take is to do pixel wise prediction and then agglomerate those together to ultimately get the final object instances. In this thesis we presented a novel loss function which we have used to train a U-Net which predicts n-dimensional embedding maps or ARID(Affinity Representing Instance Descriptors). These embedding vectors contain dense information which can then be used to generate segmentation maps using the post processing approaches. Previous methods have attempted to learn affinities but are prone to errors resulting in erroneous segmentation. We show that our segmentation pipeline using ARID embedding map surpasses the performance of the affinity based networks and solve the problem of merge errors. Our segmentation pipeline have two phases, first one is predicting ARID embedding for which we have trained U-Net architecture using ultrametric loss. Multiple configurations were tested and compared. Second phase is post processing. Post processing is further divided in two steps segmentation generation and refinement. We presented a very basic technique to generate a euclidean minimum spanning tree and prune the edges with distance bigger than the provided threshold to generate segmentation. The other part of the post processing pipeline is segmentation refinement. Where we proposed approaches to refine the generated segmentation. We have used IOU scores under thresholds of Average Precision(AP) raging from 0.5 to 0.95 with an increment of 0.05 to evaluate the performance. The best average AP0.5 IOU score that we got from the affinity based networks is 0.63, we have shown that our segmentation pipeline generates the segmentation maps which gives the best average performance of 0.826 AP0.5 IOU score, surpassing the affinity based network performance. We have also shown the failure modes of our proposed loss function and presented future scope of research in the field. Embedding based approaches show promise to do efficient instance segmentation especially in complex scenes as is in the microscopic data. The generalized loss function that we have presented in this thesis is capable of doing this task, and presents a better alternative to using affinity based methods to do segmentation.--Author's abstract