Unification of Semantic and Instance Segmentation with BoundaryX
Abstract Semantic segmentation is a field of image content recognition in which each pixel is classified according to the type of object it belongs to, while instance segmentation distinguishes individual object instances. A novel method, BoundaryX, is proposed to unify both tasks without relying on bounding boxes. Each pixel is classified, and boundaries are drawn around separate instances, enabling easy bounding box calculation without shape constraints or region proposals. Both instanced objects (like people) and non-instanced ones (like the sky) are handled by BoundaryX, without hardcoded exceptions. The quality of the method was evaluated on the COCO dataset for the class “people” by measuring Intersection over Union (IoU) for the semantic segmentation and bounding boxes recall and precision. The method achieved 0.774 IoU for semantic segmentation, 75% recall, and 83% precision for bounding box quality. Segmentation pipelines are simplified through the unified solution and flexible boundary-based representation provided by BoundaryX.
- Dissertation
- 10.17760/d20439211
- Aug 24, 2022
Instance segmentation algorithms are used everywhere, be it self driving cars, scene mapping by autonomous robots or analyzing medical scans. Instance segmentation can be thought of as further refinement of semantic segmentation. Object detection algorithms try to detect objects from the scene by enclosing them in bounding boxes, semantic segmentation tries to label these objects, whereas instance segmentation tries to label each unique instance of these objects. The task is quite complex and becomes even more challenging when the scope is microscopic data. Objects in microscopic data do not usually follow a fixed shape or orientation, therefore it becomes very difficult to identify unique instances of these objects using axis aligned bounding boxes. The alternative approach that researchers take is to do pixel wise prediction and then agglomerate those together to ultimately get the final object instances. In this thesis we presented a novel loss function which we have used to train a U-Net which predicts n-dimensional embedding maps or ARID(Affinity Representing Instance Descriptors). These embedding vectors contain dense information which can then be used to generate segmentation maps using the post processing approaches. Previous methods have attempted to learn affinities but are prone to errors resulting in erroneous segmentation. We show that our segmentation pipeline using ARID embedding map surpasses the performance of the affinity based networks and solve the problem of merge errors. Our segmentation pipeline have two phases, first one is predicting ARID embedding for which we have trained U-Net architecture using ultrametric loss. Multiple configurations were tested and compared. Second phase is post processing. Post processing is further divided in two steps segmentation generation and refinement. We presented a very basic technique to generate a euclidean minimum spanning tree and prune the edges with distance bigger than the provided threshold to generate segmentation. The other part of the post processing pipeline is segmentation refinement. Where we proposed approaches to refine the generated segmentation. We have used IOU scores under thresholds of Average Precision(AP) raging from 0.5 to 0.95 with an increment of 0.05 to evaluate the performance. The best average AP0.5 IOU score that we got from the affinity based networks is 0.63, we have shown that our segmentation pipeline generates the segmentation maps which gives the best average performance of 0.826 AP0.5 IOU score, surpassing the affinity based network performance. We have also shown the failure modes of our proposed loss function and presented future scope of research in the field. Embedding based approaches show promise to do efficient instance segmentation especially in complex scenes as is in the microscopic data. The generalized loss function that we have presented in this thesis is capable of doing this task, and presents a better alternative to using affinity based methods to do segmentation.--Author's abstract
- Research Article
30
- 10.1109/access.2020.3003917
- Jan 1, 2020
- IEEE Access
Instance segmentation is typically based on an object detection framework. Semantic segmentation is conducted on the bounding boxes that are returned by detectors. NMS (non-maximum suppression) is a common post-processing operation in instance segmentation and object detection tasks. It is typically used after bounding box regression to eliminate redundant bounding boxes. The evaluation criteria for object detection require that the bounding box be as close as possible to the ground truth, but they do not emphasize the integrity of the included object. However, sometimes the bounding boxes cannot contain the complete objects, and the parts beyond the bounding boxes cannot be correctly predicted in the subsequent semantic segmentation. To solve this problem, we propose the Syncretic-NMS algorithm. The algorithm takes traditional NMS as the first step and processes the bounding boxes obtained by traditional NMS, judges the neighboring bounding boxes of each bounding box, and combines the neighboring boxes that are strongly correlated with the corresponding bounding boxes. The coordinates of the merged box are the four coordinate extremes of the bounding box and the highly relevant neighboring box. The neighboring box with strong correlation is merged with the corresponding bounding box. Based on an analysis of the influences of corresponding factors, the criteria for correlation judgment are specified. Experimental results on the MS COCO dataset demonstrate that Syncretic-NMS can steadily increase the accuracy of instance segmentation, while experimental results on the Cityscapes dataset prove that the algorithm can adapt to application scenario changes. The computational complexity of Syncretic-NMS is the same as that of traditional NMS. Syncretic-NMS is easy to implement, requires no additional training, and can be easily integrated into the available instance segmentation framework.
- Conference Article
7
- 10.1109/icra48506.2021.9560798
- May 30, 2021
Panoptic Segmentation aims to provide an understanding of background (stuff) and instances of objects (things) at a pixel level. It combines the separate tasks of semantic segmentation (pixel level classification) and instance segmentation to build a single unified scene understanding task. Typically, panoptic segmentation is derived by combining semantic and instance segmentation tasks that are learned separately or jointly (multi-task networks). In general, instance segmentation networks are built by adding a foreground mask estimation layer on top of object detectors or using instance clustering methods that assign a pixel to an instance center. In this work, we present a fully convolution neural network that learns instance segmentation from semantic segmentation and instance contours (boundaries of things). Instance contours along with semantic segmentation yield a boundary aware semantic segmentation of things. Connected component labeling on these results produces instance segmentation. We merge semantic and instance segmentation results to output panoptic segmentation. We evaluate our proposed method on the CityScapes dataset to demonstrate qualitative and quantitative performances along with several ablation studies. Our overview video can be accessed from https://youtu.be/wBtcxRhG3e0.
- Research Article
8
- 10.1016/j.geomorph.2024.109212
- Apr 22, 2024
- Geomorphology
Detection of karst depression in Brazil comparing different semantic and instance segmentations and global digital elevation models
- Research Article
21
- 10.3390/rs13142788
- Jul 15, 2021
- Remote Sensing
Instance segmentation of high-resolution aerial images is challenging when compared to object detection and semantic segmentation in remote sensing applications. It adopts boundary-aware mask predictions, instead of traditional bounding boxes, to locate the objects-of-interest in pixel-wise. Meanwhile, instance segmentation can distinguish the densely distributed objects within a certain category by a different color, which is unavailable in semantic segmentation. Despite the distinct advantages, there are rare methods which are dedicated to the high-quality instance segmentation for high-resolution aerial images. In this paper, a novel instance segmentation method, termed consistent proposals of instance segmentation network (CPISNet), for high-resolution aerial images is proposed. Following top-down instance segmentation formula, it adopts the adaptive feature extraction network (AFEN) to extract the multi-level bottom-up augmented feature maps in design space level. Then, elaborated RoI extractor (ERoIE) is designed to extract the mask RoIs via the refined bounding boxes from proposal consistent cascaded (PCC) architecture and multi-level features from AFEN. Finally, the convolution block with shortcut connection is responsible for generating the binary mask for instance segmentation. Experimental conclusions can be drawn on the iSAID and NWPU VHR-10 instance segmentation dataset: (1) Each individual module in CPISNet acts on the whole instance segmentation utility; (2) CPISNet* exceeds vanilla Mask R-CNN 3.4%/3.8% AP on iSAID validation/test set and 9.2% AP on NWPU VHR-10 instance segmentation dataset; (3) The aliasing masks, missing segmentations, false alarms, and poorly segmented masks can be avoided to some extent for CPISNet; (4) CPISNet receives high precision of instance segmentation for aerial images and interprets the objects with fitting boundary.
- Research Article
9
- 10.1016/j.imavis.2021.104129
- Mar 3, 2021
- Image and Vision Computing
HCFS3D: Hierarchical coupled feature selection network for 3D semantic and instance segmentation
- Research Article
29
- 10.3390/rs14030531
- Jan 23, 2022
- Remote Sensing
Instance segmentation in remote sensing images is challenging due to the object-level discrimination and pixel-level segmentation for the objects. In remote sensing applications, instance segmentation adopts the instance-aware mask, rather than horizontal bounding box and oriented bounding box in object detection, or category-aware mask in semantic segmentation, to interpret the objects with the boundaries. Despite these distinct advantages, versatile instance segmentation methods are still to be discovered for remote sensing images. In this paper, an efficient instance segmentation paradigm (EISP) for interpreting the synthetic aperture radar (SAR) and optical images is proposed. EISP mainly consists of the Swin Transformer to construct the hierarchical features of SAR and optical images, the context information flow (CIF) for interweaving the semantic features from the bounding box branch to mask branch, and the confluent loss function for refining the predicted masks. Experimental conclusions can be drawn on the PSeg-SSDD (Polygon Segmentation—SAR Ship Detection Dataset) and NWPU VHR-10 instance segmentation dataset (optical dataset): (1) Swin-L, CIF, and confluent loss function in EISP acts on the whole instance segmentation utility; (2) EISP* exceeds vanilla mask R-CNN 4.2% AP value on PSeg-SSDD and 11.2% AP on NWPU VHR-10 instance segmentation dataset; (3) The poorly segmented masks, false alarms, missing segmentations, and aliasing masks can be avoided to a great extent for EISP* in segmenting the SAR and optical images; (4) EISP* achieves the highest instance segmentation AP value compared to the state-of-the-art instance segmentation methods.
- Conference Article
207
- 10.1109/cvpr46437.2021.00267
- Jun 1, 2021
Weakly supervised segmentation methods using bounding box annotations focus on obtaining a pixel-level mask from each box containing an object. Existing methods typically depend on a class-agnostic mask generator, which operates on the low-level information intrinsic to an image. In this work, we utilize higher-level information from the behavior of a trained object detector, by seeking the smallest areas of the image from which the object detector produces almost the same result as it does from the whole image. These areas constitute a bounding-box attribution map (BBAM), which identifies the target object in its bounding box and thus serves as pseudo ground-truth for weakly supervised semantic and instance segmentation. This approach significantly outperforms recent comparable techniques on both the PASCAL VOC and MS COCO benchmarks in weakly supervised semantic and instance segmentation. In addition, we provide a detailed analysis of our method, offering deeper insight into the behavior of the BBAM.
- Research Article
- 10.1155/acis/1918054
- Jan 1, 2025
- Applied Computational Intelligence and Soft Computing
Deep learning–based segmentation models have gained significant focus in various computer vision applications, including remote sensing and medical imaging. There exist deep learning architectures for semantic and instance segmentation separately, with limitations prevailing such as imprecise boundary delineation, poor spatial consistency, improper fine‐grained object separation, and inaccurate instance segmentation, particularly while handling intricate object structures in remote sensing images (RSI). To mitigate the aforementioned issues, in the present work, we propose a unified deep framework that integrates both semantic and instance segmentation within a single architecture tailored for high‐resolution RSI. Our framework combines an improved attention residual U‐Net (IARU‐Net) for pixel‐level semantic segmentation and a dynamic Mask R‐CNN for instance‐level segmentation. To further refine spatial coherence and boundary delineation, we incorporate the postprocessing technique such as conditional random fields (CRFs) on the output segmentation map of the enhanced U‐Net to improve spatial consistency and edge sharpness. This refined semantic mask serves as input to the dynamic Mask R‐CNN model for instance segmentation, where the graph‐based refinement module (GRM) is employed to improve boundary accuracy by leveraging graph‐based smoothing techniques. Our approach ensures improved object delineation, increases the segmentation accuracy, and decreases false positives compared to conventional deep learning architectures. Evaluation outcomes on standard datasets illustrate that the proposed approach attains superior performance, highlighting its effectiveness in both semantic and instance segmentation tasks. The results validate the effectiveness of jointly modeling semantic and instance‐level information, providing a more comprehensive understanding of complex remote sensing scenes.
- Conference Article
4
- 10.1109/icpr48806.2021.9412635
- Jan 10, 2021
Instance segmentation and panoptic segmentation is being paid more and more attention in recent years. In comparison with bounding box based object detection and semantic segmentation, instance segmentation can provide more analytical results at pixel level. Given the insight that pixels belonging to one instance have one or more common attributes of current instance, we bring up an one-stage instance segmentation network named Common Attribute Support Network (CASNet), which realizes instance segmentation by predicting and clustering common attributes. CASNet is designed in the manner of fully convolutional and can implement training and inference from end to end. And CASNet manages predicting the instance without overlaps and holes, which problem exists in most of current instance segmentation algorithms. Furthermore, it can be easily extended to panoptic segmentation through minor modifications with little computation overhead. CASNet builds a bridge between semantic and instance segmentation from finding pixel class ID to obtaining class and instance ID by operations on common attribute. Through experiment for instance and panoptic segmentation, CASNet gets mAP 32.8% and PQ 59.0% on Cityscapes validation dataset by joint training, and mAP 36.3% and PQ 66.1% by separated training mode. For panoptic segmentation, CASNet gets state-of-the-art performance on the Cityscapes validation dataset.
- Research Article
656
- 10.1609/aaai.v32i1.12269
- Apr 27, 2018
- Proceedings of the AAAI Conference on Artificial Intelligence
Most state-of-the-art scene text detection algorithms are deep learning based methods that depend on bounding box regression and perform at least two kinds of predictions: text/non-text classification and location regression. Regression plays a key role in the acquisition of bounding boxes in these methods, but it is not indispensable because text/non-text prediction can also be considered as a kind of semantic segmentation that contains full location information in itself. However, text instances in scene images often lie very close to each other, making them very difficult to separate via semantic segmentation. Therefore, instance segmentation is needed to address this problem. In this paper, PixelLink, a novel scene text detection algorithm based on instance segmentation, is proposed. Text instances are first segmented out by linking pixels within the same instance together. Text bounding boxes are then extracted directly from the segmentation result without location regression. Experiments show that, compared with regression-based methods, PixelLink can achieve better or comparable performance on several benchmarks, while requiring many fewer training iterations and less training data.
- Research Article
6
- 10.1016/j.patcog.2021.108240
- Aug 18, 2021
- Pattern Recognition
Learning panoptic segmentation through feature discriminability
- Research Article
121
- 10.1016/j.engappai.2019.103271
- Oct 8, 2019
- Engineering Applications of Artificial Intelligence
Semantic versus instance segmentation in microscopic algae detection
- Conference Article
- 10.21528/cbic2023-103
- Dec 31, 2023
The use of Artificial Intelligence (AI) as an assistant for diagnosis in imaging exams has already proven to be effective, and is known as Computer-Aided Diagnosis (CAD). This paper evaluates the effectiveness of using a single network, YOLOv8x is the current state-of-the-art in the YOLO family, for lumbar spine detection and segmentation in Magnetic Resonance Imaging (MRI) exams. The network was used for detection, classification, and semantic segmentation, generating the masks over the vertebrae, which simplified the implementation and reduced the computational cost. Encouraging results were obtained using a dataset of 1,116 samples (images). The detection step achieved a mean average precision (mAP) of 0.989 at 50% intersection over union (IoU), mAP:50-95 of 0.886, recall of 0.98, and precision of 0.97. For bounding box marking, the following results were achieved: mAP of 0.978 at 50% IoU, mAP:50-95 of 0.882, recall of 0.971, and precision of 0.948. The semantic segmentation step achieved a mAP of 0.978 at 50% IoU, mAP:50-95 of 0.856, recall of 0.971, and precision of 0.948. These results demonstrate the effectiveness of using YOLOv8x for lumbar spine detection and segmentation in MRI exams.
- Conference Article
6
- 10.1109/cac48633.2019.8997311
- Nov 1, 2019
Intersection over Union (IoU) is the most important metric in visual tracking benchmark. However, IoU cannot always accurately describe the similarity between two bounding boxes. In some cases, IoU cannot reflect the similarity of location, shape (aspect ratio) and area between two bounding boxes correctly, which means even if two group bounding boxes have same IoU, their positions, shapes and areas deviation may be different. In this paper, we propose a new evaluation metric, called Advanced Intersection over Union (AIoU), to solve this problem by adding penalties for positions, shapes and areas changes between two bounding boxes, and apply AIoU as a loss function to the bounding box regression part of Siamese tracker. By training the regression branch of Siamese tracker with AIoU loss, IoU loss and traditional minimum Mean Square Error (MSE) loss function, we show that the new AIoU loss is more effective for locating than MSE loss and IoU loss on VOT benchmark. At the same time, we combine SiamRPN with AIoU loss to form the SiamAIoU tracker and compare our method with many state-of-the-art trackers, the results show that SiamAIoU get higher scores on OTB100, VOT2016 and VOT2018. In addition, our tracker runs at 35 FPS in real time.