Related Topics
Articles published on Object Segmentation
Authors
Select Authors
Journals
Select Journals
Duration
Select Duration
4285 Search results
Sort by Recency
- New
- Research Article
- 10.1016/j.media.2026.103971
- May 1, 2026
- Medical image analysis
- Chengjin Yu + 7 more
Diversity-driven MG-MAE: Multi-granularity representation learning for non-salient object segmentation.
- Research Article
1
- 10.1145/3787522
- Apr 20, 2026
- ACM Transactions on Multimedia Computing, Communications, and Applications
- Saikat Dutta + 3 more
Remote sensing image segmentation poses significant challenges in generalizing to unseen categories during the evaluation phase. Existing open-vocabulary segmentation methods, primarily designed for natural images, struggle to cope with the spatial complexity, scale variation, and high-resolution characteristics of remote sensing imagery. Specifically, scale variations during inference can degrade performance, as the model tends to overfit to fixed-scale patterns encountered during training. This also affects the model’s ability to recognize unseen or novel class objects appearing in varying sizes or resolutions during testing. These limitations increase the need for developing open-vocabulary segmentation methods addressing the challenges of geospatial images. In this work, we introduce AerOSeg ++, an open-vocabulary segmentation method in remote sensing, focusing on scale-invariant feature learning. We first compute robust image-text correlation features using rotated input images and domain-specific prompts. These are refined via spatial and class refinement blocks, guided by SAM features to enhance spatial consistency. To upscale the refined correlation features, we propose a multi-scale decoder framework that fuses fine-grained texture features with SAM-derived features. By leveraging texture information across multiple receptive fields, AerOSeg++ effectively captures scale-consistent patterns, facilitating accurate segmentation of objects across varying spatial resolutions. Additionally, our training pipeline incorporates ScaleDrop, a computationally efficient parameter-free feature rescaling module ensuring scale-invariant feature representation learning. Our proposed model has shown significant performance gains compared to the state-of-the-art open-vocabulary methods when evaluated on three benchmark datasets for remote sensing—iSAID, DLRSD, and OpenEarthMap. These results highlight the effectiveness of our scale-invariant design and texture-guided multi-scale feature upsampling in handling the challenges of open-vocabulary segmentation in remote sensing imagery.
- Research Article
- 10.1109/tpami.2026.3684742
- Apr 16, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Shengye Qiao + 4 more
Recent progress in semi-supervised video object segmentation has largely hinged on memory-based methods. However, when faced with increasingly tough challenges emerging in complex scenarios, such as fundamental semantic transformations and severe spatial deformations, the fixed-interval memory update mechanism usually adopted in these memory-based methods is insufficient to align with the pivotal moments of object changes. This inflexible mechanism motivates us to design an adaptive memory update mechanism in response to the semantic-spatial changes of target objects. To this end, we propose a novel Change-Sensitive Network (CSNet) to learn when and how to update memory to effectively address intricate challenges in complex scenarios. Specifically, wefirst design an Adaptive Perception-Capture module with a hierarchical contrastive learning loss to determine when to update memory moments by measuring the extent of object changes, thus dividing entire videos into different object-change clips. To further extract and highlight object changes to assist in the segmentation of frames after changes occur, we construct Dynamic Memory Update modules to redefine how to update memory by smoothly retaining the object prototypes within clips and dynamically amplifying the object variations across clips. Extensive experiments demonstrate that our proposed CSNet exhibits clear superiority when evaluated on eight datasets covering three kinds: common, complex and long-video datasets.
- Research Article
- 10.1109/tpami.2025.3648837
- Apr 1, 2026
- IEEE transactions on pattern analysis and machine intelligence
- Miao Wang + 3 more
Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language embedded scene representations have made progress by incorporating language features into 3D spaces. However, their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis, directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work, we introduce Language Embedded 3D Gaussians, a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians, we propose a dedicated quantization scheme that drastically alleviates the memory requirement, and a novel embedding procedure that achieves smoother yet high accuracy query, countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language embedded representations, while maintaining real-time rendering frame rates on a single desktop GPU.
- Research Article
- 10.1016/j.dib.2026.112477
- Apr 1, 2026
- Data in brief
- Michele Elia + 8 more
Towards sustainable management of Xylella fastidiosa vectors: An annotated image dataset for automated in-field detection of Aphrophoridae foam.
- Research Article
- 10.1088/2057-1976/ae5970
- Apr 1, 2026
- Biomedical Physics & Engineering Express
- Rahmat Riyadi + 5 more
The purpose of this study is to develop software for automatic low-contrast objects segmentation in the ACR 464 computed tomography (CT) phantom and to evaluate the quantitative effect of radiation dose and object size on low-contrast detectability (LCD). The software was developed using MATLAB R2013a. The anchor coordinate of the largest low-contrast object (25 mm) was determined statistically by rotating a region of interest (ROI) of identical size over 360° in 1° angular increments to identify the coordinate corresponding to the maximum CT number. Meanwhile, the center of the phantom was determined based on a threshold-based method. The two center coordinates were used as references for detecting other low-contrast objects using a template matching. Regions of interests (ROIs) were automatically located within low-contrast objects and in the background, which is at the center of the phantom's image. Mean CT number, noise, contrast, and contrast-to-noise ratio (CNR) were subsequently computed. The low-contrast object detectability threshold was defined as a CNR cut-off of 1. The robustness of the anchor coordinate determination algorithm was evaluated across a range of imaging conditions, specifically targeting scenarios involving extreme noise levels and image tilting. Testing of the algorithm system was carried out on images scanned with various volume CT dose indexes (CTDIvols) of 21.4, 26.8, 32.1, 37.5, 42.8, and 53.6 mGy. The results were compared with a manual method (using micro DICOM viewer software) and statistical analysis of paired sample t-test between the results of automatic and manual methods was carried out. The results obtained using the automated methods indicate that the minimum resolved object sizes were 5, 5, 4, 4, 4, and 4 mm at CTDIvolvalues of 21.6, 26.8, 32.1, 37.5, 42.8, and 53.6 mGy, respectively. In comparison, the manual method yielded minimum resolved object sizes of 5, 5, 5, 4, 4, and 4 mm across the same CTDIvollevels, demonstrating a slight improvement in resolution for the automated approach at the 32.1 mGy dose level. An increase in CTDIvolaffected the increase in CNR. In conclusion, an automatic method for detecting low-contrast objects in the ACR 464 CT phantom was successfully completed. Low-contrast objects segmentation was shown to be accurate in test images.
- Research Article
1
- 10.1109/tmi.2025.3627954
- Apr 1, 2026
- IEEE transactions on medical imaging
- Yuwen Chen + 7 more
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
- Research Article
- 10.1109/tmech.2025.3604612
- Apr 1, 2026
- IEEE/ASME Transactions on Mechatronics
- Jie Huang + 3 more
Edge-Aware Transformer for Adhesion Object Segmentation in XRT-Based Ore Presorting
- Research Article
4
- 10.1109/tcsvt.2025.3626574
- Apr 1, 2026
- IEEE Transactions on Circuits and Systems for Video Technology
- Jin Zhang + 5 more
Complex scene segmentation aims to segment objects with intricate details or those concealed within the background. Despite significant advancements, a persistent challenge remains: accurately identifying object edges in backgrounds with high inherent similarity and complex structures. To address this, we identify the prevalent spectral bias in image segmentation, where networks preferentially learn low-frequency information, as a key impediment to recognizing and learning object edges, which are rich in high-frequency details. To mitigate this bias, we propose MCNet, a segmentation framework designed to promote balanced frequency learning. MCNet comprises two primary components: multi-frequency perception (MP), which independently captures high-frequency details and low-frequency structural components of objects, and complementary fusion (CF), which intelligently fuses these distinct frequency features through learnable, adaptive mechanisms. Crucially, MCNet employs a novel frequency-aware consistency adversarial loss to explicitly guide the learning across different frequency bands. MCNet effectively integrates MP and CF, enhancing the detection of high-frequency details and low-frequency structures, thereby alleviating challenges posed by spectral bias. We evaluate the proposed method on complex scene segmentation tasks, including camouflaged object detection and dichotomous image segmentation. Through extensive comparisons with 31 existing methods across 8 benchmark datasets, we demonstrate the superiority of the proposed method.
- Research Article
- 10.3390/s26072170
- Mar 31, 2026
- Sensors (Basel, Switzerland)
- Biao Wang + 3 more
Transformer models have achieved powerful performance in various computer vision tasks. However, their black-box nature severely limits model interpretability and the reliability of real-world applications. Most existing interpretation methods generate explanation maps by perturbing masks from the last layer of the Transformer encoder, but they often overlook uncertain information in masks and detail loss during upsampling and downsampling, resulting in coarse localization, blurred boundaries, and significant background noise in explanations. To address these issues, this paper proposes a self-distillation object segmentation method based on sequential three-way mask and attention fusion (SAF-SD), targeting salient and camouflaged binary object segmentation tasks (sub-tasks of binary pixel-level segmentation). The method consists of two core modules: the sequential three-way mask (S3WM) module and the attention fusion (AF) module. The S3WM module performs strict threshold filtering on masks generated from the final-layer feature maps of the Transformer, aiming to accurately segment foreground objects from backgrounds via binary pixel-level prediction. The AF module aggregates attention matrices across all Transformer encoder layers to construct a cross-layer relation matrix, capturing global semantic dependencies among image patches (e.g., interactions between foreground, background, and edge regions). It then computes the importance score for each patch, refining details and suppressing noise in the initial explanation results. Extensive experimental results demonstrate that SAF-SD significantly outperforms existing baseline methods across key evaluation metrics.
- Research Article
- 10.1371/journal.pone.0345762
- Mar 31, 2026
- PLOS One
- Xin Wang + 7 more
Deep learning has recently made remarkable progress in remote sensing image segmentation, with hybrid architectures that integrate convolutional neural networks (CNNs) and Transformers emerging as a promising solution, particularly for high-resolution imagery. However, challenges remain in complex remote sensing scenes, particularly in capturing detailed boundary structures and small-scale targets. One key limitation lies in the suboptimal cross-level feature fusion within the encoder, resulting in semantic misalignment that hinders the precise segmentation of small objects and fine structural details. Additionally, during the decoding stage, the lack of explicit boundary guidance frequently causes the loss of edge information during feature reconstruction, compromising the delineation of object contours in intricate environments. To address these issues, We propose a novel hybrid architecture named Boundary-Guided Semantic Compensation Network (BGSC-Net). Our framework integrates two key components: a Cross-Level Semantic Compensation Module (CLSCM) that dynamically fuses high-level semantics with low-level spatial details to enhance small object segmentation, and an Auxiliary Boundary Supervision Module (ABSM) that enhances structural modeling for blurry or complex boundaries through explicit boundary modeling and an auxiliary supervision strategy based on joint optimization of the edge and main segmentation branches. Experiments show that BGSC-Net achieves superior segmentation performance, with mIoU scores of 87.57% on Potsdam, 85.61% on Vaihingen, 55.05% on LoveDA, and 74.77% on UAVid. To further validate its generalization capability in specialized fine-grained segmentation tasks, we evaluated the model on our challenging self-constructed Mangrove Species Fine-grained Segmentation Dataset (MSFSD), where it achieved an mIoU of 89.58%, confirming its practical utility for precise mangrove species mapping.
- Research Article
1
- 10.1002/advs.202517738
- Mar 31, 2026
- Advanced science (Weinheim, Baden-Wurttemberg, Germany)
- Christopher J Buswinka + 3 more
Segmenting individual instances of mitochondria from imaging datasets can provide rich quantitative information, but manual segmentation is prohibitively time-consuming-prompting the development of automated algorithms based on deep neural networks. Existing solutions for various segmentation tasks are optimized for either: high-resolution three-dimensional imaging, relying on well-defined object boundaries (e.g., whole neuron segmentation in volumetric electron microscopy datasets); or low-resolution two-dimensional imaging, boundary-invariant but poorly suited to large 3D objects (e.g., whole-cell segmentation of light microscopy images). However, there is a middle ground that challenges current segmentation tools: large 3D objects with ambiguous boundaries, such as mitochondria in whole-cell 3D electron microscopy datasets. To address this, we developed Skeleton-Oriented Object Segmentation (SKOOTS)-a novel, general-purpose 3D segmentation framework for efficiently segmenting densely packed, morphologically complex objects. SKOOTS is fast, accurate, and memory-efficient, and can be applied to segment mitochondria and other structures in both 3D light and electron microscopy datasets. By combining skeleton-based instance segmentation with a scalable embedding approach, SKOOTS bridges a key gap in existing segmentation strategies and enables biologically meaningful, large-scale analysis of 3D biomedical imaging data. We demonstrate this by segmenting >15000 mitochondria from cochlear hair cells and supporting cells across experimental conditions in under 2 h on a consumer-grade PC, enabling downstream morphological analysis that revealed subtle structural changes following aminoglycoside exposure. SKOOTS is fully open-source, easy to retrain, and designed to support diverse datasets, making it broadly accessible to the research community.
- Research Article
- 10.3390/s26072029
- Mar 24, 2026
- Sensors (Basel, Switzerland)
- Linghao Dai + 4 more
Safety monitoring in container hoisting operations within rail-road intermodal logistics parks is a critical task in industrial safety management. Such scenarios are characterized by complex environments, large variations in target scales, deformable object shapes, and frequent occlusions, which pose significant challenges to visual perception systems. Conventional single-task models suffer from inherent limitations in handling low recall rates for distant small targets and insufficient adaptability to geometric deformations, making them inadequate for high-precision, real-time safety warning applications. To address these challenges, this study proposes a unified visual analysis framework that integrates semantic segmentation and object detection to enhance the recognition performance of small and deformable targets in complex operational environments, enabling real-time perception and safety warning of key objects and hazardous regions within container yards. Specifically, we introduce FSD-YOLO, a fusion-based architecture composed of the following key components. First, a SegFormer-based semantic segmentation module is employed to achieve pixel-level delineation of different operational regions. Second, an improved object detection network is developed based on the YOLOv8n architecture, incorporating: (1) the integration of C2f modules in the shallow layers of the backbone to enhance high-resolution feature extraction; (2) the embedding of C2fDCN modules within the detection head to improve modeling capability for deformable objects via deformable convolution; (3) the adoption of CARAFE upsampling operators to optimize multi-scale feature fusion; and (4) a dynamic loss-weighting strategy for small objects, where loss weights are adaptively adjusted according to target area to increase training emphasis on small-scale targets. Finally, a decision-level fusion strategy is applied to combine segmentation and detection outputs, enabling real-time safety judgment based on semantic rules. Experimental results on a self-constructed container yard dataset demonstrate that the proposed detection model achieves an mAP50-95 of 0.6433 and an mAP50 of 0.9565, significantly outperforming the baseline YOLOv8n model (mAP50-95: 0.5394, mAP50: 0.8435), thereby validating the effectiveness of the proposed framework.
- Research Article
- 10.1117/1.jei.35.3.031206
- Mar 23, 2026
- Journal of Electronic Imaging
- Nikola Pižurica + 3 more
Many industrial inspection tasks often require 3D information in order to be solved properly. Some examples include object pose estimation with 6 degrees of freedom (6D), conformity checks and volumetric measurements. However, obtaining accurate 3D or depth information about a scene in an industrial environment is in itself a considerably challenging task, especially in production of mechanical parts, where many reflective metallic surfaces are encountered. Reflections make it difficult or even impossible to employ specialized depth sensors such as LiDARs and stereo 3D cameras, or traditional depth estimation and 3D reconstruction algorithms. Conversely, the fact that industrial use cases often involve working with a very low volume of available data usually prohibits the training of custom deep learning models for depth estimation. As a solution, we demonstrate the viability of utilizing large, pre-trained foundational models for monocular depth estimation (such as Depth Anything variants) in industrial production. More specifically, we showcase successful applications of these models in the domain of 5-axis machining setup inspection. The proposed methodology enables adjusting the outputs of foundational depth models in a use case-specific setting, based on known values of 5-axis machine coordinates. Importantly, such adjustments are performed in a way that avoids fine-tuning of neural network parameters, and therefore our methodology can be applied even under extreme data scarcity conditions, using less than 50 images. Moreover, we showcase two downstream applications of the proposed monocular depth estimation approach - depth-based semantic segmentation of objects of interest, as well as depth-based 6D pose estimation. It is shown that the absolute relative error (AbsRel) of our depth estimations can be as low as 2.54% on average, which in turn leads to very precise semantic segmentation (89.4% IoU) and object pose estimations (6.39 mm ADD for objects with dimensions 200 mm×200 mm×60 mm). Lastly, the proposed segmentation and pose estimation methods exhibit excellent generalization capabilities, maintaining strong accuracy even when tested on images of completely new setups.
- Research Article
- 10.3390/s26061949
- Mar 20, 2026
- Sensors (Basel, Switzerland)
- Jiantao Yang + 1 more
Segmenting non-rigid objects such as smoke in video requires effective utilization of temporal information, which remains challenging due to their irregular deformation and complex appearance variations. Based on our previously proposed DeffNet for industrial fumes video segmentation, this letter presents a novel adaptive frame selection algorithm that employs fuzzy logic control to dynamically optimize the temporal processing step size for the specific task of industrial smoke video segmentation. Our method quantifies inter-frame variation using the Structural Similarity Index (SSIM) and Normalized Cross-Correlation (NCC) as inputs to a fuzzy inference system. Gaussian membership functions, shaped via K-means clustering, and a five-rule fuzzy system are designed to determine the optimal step size, maximizing informative dynamic feature extraction while minimizing redundant computation. As a lightweight front-end module, the algorithm integrates seamlessly into the existing DeffNet segmentation framework without reconstructing new network architecture. Extensive experiments on a dedicated industrial smoke video dataset demonstrate that our approach effectively improves the segmentation performance of DeffNet, achieving 84.27% Intersection over Union (IoU) while maintaining a high inference speed of 39.71 FPS. This work provides an efficient and scene-specific solution for temporal modeling in industrial smoke non-rigid object segmentation and offers a practical improved strategy for DeffNet in real-time industrial smoke monitoring.
- Research Article
- 10.55592/cilamce2025.v5i.14453
- Mar 18, 2026
- Ibero-Latin American Congress on Computational Methods in Engineering (CILAMCE)
- Marcos Antonio Nunes Da Silveira + 1 more
This paper proposes a computer vision system for active monitoring of logistics operations, focusing on detecting and mitigating risks, such as interactions between people and vehicles. The main objective is to increase safety in the operational environment. IP cameras with RTSP transmission are used. Images are processed remotely on a notebook connected to the Wi-Fi network, while visual alerts are generated locally on the identified target. The use of color-coded segmentation in HSV space allows dynamically defining regions of interest (ROIs), with real-time adjustments using trackbars. Segmented objects have their contours detected and receive blue ROIs with configurable offsets. Target classification as ”person” or ”vehicle” is performed by a Convolutional Neural Network (CNN) with the YOLOv8n model. The preprocessing of the image dataset used the transfer learning technique via Roboflow, using 1,500 images. Data argumentation was applied, generating a final dataset of 3,612 images. The system was implemented in Python, using OpenCV and integration with ESP32 via HTML requests, triggering an alert LED. The combination of color segmentation in HSV space and CNN offers an accurate, efficient, and low-cost solution compared to traditional methods.
- Research Article
- 10.1038/s41598-026-44542-0
- Mar 18, 2026
- Scientific reports
- N Deluxni + 4 more
The rapid accumulation of marine debris poses a substantial threat to ocean ecosystems, specifically the degradation of plastics, composite materials, and metals. Effective detection and classification of debris by material type are essential during the waste management and material recycling process. However, the underwater debris detection is often hampered by low contrast, color distortion, and noise in underwater imagery. Different deep learning models are proposed in the literature for debris detection, however most of them suffered with limitations in real time implementation complexity and inaccurate instance segmentation. This work proposes an adaptive hybrid lightweight Mask R-CNN system including image augmentation, object identification, and real-time instance segmentation to handle these issues. The preprocessed images are feed to the lightweight mask RCNN model for the object detection and segmentation. The proposed model uses an upgraded Region Proposal Network (RPN) for exact localization of underwater trash and MobileNetV3 as a lightweight backbone for effective feature representation. The model uses data augmentation methods including contrast correction, flipping, and blurring to improve robustness; the model also trained on the proprietary underwater debris datasets. Compared to the conventional approaches, performance evaluation metrics employing Mean Average Precision (mAP), Structural Similarity Index Measure (SSIM), Intersection over Union (IoU), and Peak Signal to Noise Ratio (PSNR) shows better accuracy. Furthermore, the model performs real-time computing at 30 FPS, which makes it highly suitable for usage in real-time operations.
- Research Article
- 10.1186/s12903-026-08097-w
- Mar 17, 2026
- BMC oral health
- Esra Ozcelik + 4 more
This study aimed to evaluate and compare the performance of state-of-the-art deep learning-based object detection and segmentation architectures —YOLOv8, YOLOv11, Mask R-CNN and DeepLabV3—for automated tooth detection and numbering in children aged 6 to 12 years on panoramic radiographs. A total of 1,378 anonymized panoramic images were retrospectively obtained and annotated using the FDI numbering system with polygon labeling. The dataset was stratified into age groups (6–12 years) to assess age-specific performance. All tested models (YOLOv8, YOLOv11, Mask R-CNN and DeepLabV3) were trained and evaluated in two scenarios: (1) overall detection performance without age separation and (2) age-based analysis. Evaluation metrics included Precision, Recall, F1 Score, mAP50, and mAP50-95. In Scenario 1, YOLOv11 achieved higher scores across all metrics compared to YOLOv8, including Precision (0.8435), Recall (0.8755), F1 Score (0.8592), mAP50 (0.8715), and mAP50-95 (0.5613). Scenario 2 revealed performance variations across age groups, with YOLOv11 consistently outperforming YOLOv8. The highest performance was recorded at age 12 with YOLOv11, achieving 0.9657 F1 Score and 0.9817 mAP50, indicating enhanced accuracy in older children with more stable dentition. YOLOv11 demonstrated superior capability in detecting and numbering teeth on pediatric panoramic radiographs, particularly in older age groups. These findings support the potential of advanced YOLO-based models as promising decision-support tools for tasks such as standardized charting and tooth identification during the mixed dentition period.
- Research Article
- 10.1364/ao.583911
- Mar 10, 2026
- Applied optics
- Jie Shen + 5 more
Underwater images often exhibit color distortion and low illumination because of the complex imaging mechanism of the underwater scene. These issues can significantly hinder the performance of underwater vision applications, including object segmentation and detection. To solve these limitations, we propose FPSANet, an innovative feature pyramid serial attention network designed for underwater image enhancement. This framework leverages multiscale feature fusion and an advanced attention mechanism. Initially, we propose a feature pyramid fusion module to integrate spatial information across multiple scales. Subsequently, we design a serial attention module (SAM) that prioritizes illumination features and emphasizes critical color details by combining with the pixel, channel, and space attention. Moreover, both qualitative analysis and quantitative evaluations reveal that our method excels across diverse underwater datasets, i.e.,compared with the second-best comparative method, our method increases by 3.23% and 1.46%, at least in terms of the PSNR and SSIM values, respectively. The experimental results highlight its effectiveness in tasks such as image segmentation, keypoint detection, and even the enhancement of foggy images.
- Research Article
- 10.1108/dta-08-2025-0745
- Mar 2, 2026
- Data Technologies and Applications
- Quang Toai Ton + 4 more
Purpose Object detection and instance segmentation play an important role in autonomous driving, where vehicles must perceive their surroundings reliably. In practice, these tasks are commonly addressed using separate models, which increases both training complexity and deployment cost. To overcome this issue, we propose UniPercepNet-S, a lightweight dual-task framework inspired by YOLOF that brings detection and segmentation into a single unified network, aiming to support real-time perception in resource-constrained environments. Design/methodology/approach UniPercepNet-S follows a YOLOF-style one-level detection design and strengthens the backbone with a channel attention module to improve feature quality. To enable instance segmentation, we add a simple yet efficient mask prediction branch that operates directly on detected objects while keeping computation low. We evaluate the proposed framework on MS COCO and BDD100 K, covering both general object segmentation and autonomous-driving-oriented scenarios. Findings The proposed UniPercepNet-S achieves a mask AP of 38.0 on MS COCO, placing it among the top-performing entries in the COCO Detection Challenge for segmentation tasks. On BDD100 K, which reflects real-world driving conditions, the model reaches an AP of 20.3, showing that it generalizes well across different datasets. These results suggest that UniPercepNet-S can deliver accurate detection and segmentation while remaining suitable for real-time use. Originality/value This work contributes a unified and lightweight one-level framework that performs object detection and instance segmentation simultaneously, avoiding the need for heavy multi-scale architectures or separate task-specific models. By combining attention-enhanced representations with an efficient segmentation branch, UniPercepNet-S provides a practical solution for real-time perception. Its balance between simplicity, accuracy, and speed makes it especially valuable for autonomous driving and other embedded vision applications.