A Transformer-Based Hierarchical Hybrid Encoder Network for Semantic Segmentation
In the field of semantic segmentation, the limited receptive field of convolutional neural networks leads to insufficient extraction of global features, thereby affecting the accuracy of network segmentation. To address this issue, a Hierarchical Hybrid Encoder Network (HHEnet) based on Transformers is proposed for semantic segmentation. Firstly, to solve the problem of limited global feature information caused by the network’s limited receptive field, a Hierarchical Hybrid Encoder (HHE) is introduced, which consists of a Hierarchical Convolutional Encoder (HCE) and a Hierarchical Transformer Encoder (HTE). The encoder combines the advantages of convolution and transformers, allowing for effective extraction of both shallow and deep features. In order to further enhance spatial and global semantic information, the Feature Enhancement Module (FEM) was introduced, which consisted of two feature enhancement modules: spatial feature enhancement module (SEM) and global feature enhancement module (GEM), which enhanced spatial detail information and global semantic information respectively. Thus the accuracy of semantic segmentation can be improved. Finally, to alleviate the discrepancy between the features of the convolutional encoder and the transformer encoder, a Feature Guidance Module (FGM) is introduced. Experimental results conducted on Cityscapes, ADE20K and PASCAL VOC2012 datasets achieved mIoU scores of up to 81.9%, 49.4% and 79.1%, respectively. Compared to state-of-the-art networks, the research results confirm the higher segmentation accuracy of the proposed HHEnet in this study.
- Research Article
4
- 10.1109/tce.2025.3557449
- May 1, 2025
- IEEE Transactions on Consumer Electronics
In recent years, significant progress has been made in crowd counting with the development of convolutional neural networks (CNNs). However, while CNNs excel at extracting local features, their limited receptive fields restrict their ability to model global context. In contrast, Transformers can effectively model long-distance dependencies, but are inferior to CNN in capturing local detail features. Local details and global context information are crucial to handle large-scale changes in crowds. To address this problem, we propose a novel dual backbone network (DBNet) that integrates CNN and Transformer architectures, aiming to capture and aggregate both global semantic information and local detail features at multiple levels. Specifically, the dual backbone structure is designed to extract fine-grained local features while modeling long-range contextual relationships. Additionally, we introduce a multi-attention hierarchical fusion module (MAHF) that integrates global and local features from the two backbones while suppressing background noise. To further enhance accuracy in the presence of multi-scale variations, we also employ a Feature Enhancement Module (FEM), which enables the network to more effectively identify edge features and facilitates more effective multi-scale feature modeling. Extensive experiments on ShanghaiTech, UCF-QNRF, and JHU-Crowd++ datasets demonstrate that DBNet achieves competitive performance, validating the effectiveness of our approach.
- Research Article
9
- 10.1016/j.dsp.2024.104769
- Sep 11, 2024
- Digital Signal Processing
RTIA-Mono: Real-time lightweight self-supervised monocular depth estimation with global-local information aggregation
- Research Article
2
- 10.1016/j.asr.2024.06.056
- Jun 25, 2024
- Advances in Space Research
ST-MDAMNet: Swin transformer combines multi-dimensional attention mechanism for semantic segmentation of high-resolution earth surface images
- Research Article
- 10.1109/jstars.2025.3644588
- Jan 1, 2025
- IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
In recent years, the increase of multimodal image data has offered a broader prospect for multimodal semantic segmentation. However, the data heterogeneity between different modalities make it difficult to leverage complementary information and create semantic understanding deviations, which limits the fusion quality and segmentation accuracy. To overcome these challenges, we propose a hybrid attention driven CNN-Mamba multimodal fusion network (HACMNet) for semantic segmentation. It aims to fully exploit the strengths of optical images in texture and semantic representation, along with the complementary structural and elevation information from the digital surface model (DSM). This enables the effective extraction and combination of global and local complementary information to achieve higher accuracy and robustness in semantic segmentation. Specifically, we propose a progressive cross-modal feature interaction mechanism (PCMFI) in the encoder. It integrates the fine-grained textures and semantic information of optical images with the structural boundaries and spatial information of DSM, thereby facilitating more precise cross-modal feature interaction. Secondly, we design an adaptive dual-stream Mamba cross-modal fusion module (ADMCF), which leverages a learnable variable mechanism to deeply represent global semantic and spatial structural information. This enhances deep semantic feature interaction and improves the ability of model to distinguish complex land cover categories. Together, these modules progressively refine cross-modal cues and strengthen semantic interactions, enabling more coherent and discriminative multimodal fusion. Finally, we introduce a global-local feature decoder to effectively integrate the global and local information from the fused multimodal features. It preserves the structural integrity of target objects while enhancing edge detail representation, thus enhancing segmentation results. Through rigorous testing on standard datasets like ISPRS Vaihingen and Potsdam, the proposed HACMNet demonstrates advantages over prevailing methods in multimodal remote sensing analysis, particularly on challenging object classes.
- Research Article
1
- 10.3390/app14051777
- Feb 22, 2024
- Applied Sciences
Semantic segmentation of 3D point clouds in drivable areas is very important for unmanned vehicles. Due to the imbalance between the size of various outdoor scene objects and the sample size, the object boundaries are not clear, and small sample features cannot be extracted. As a result, the semantic segmentation accuracy of 3D point clouds in outdoor environment is not high. To solve these problems, we propose a local dual-enhancement network (LDE-Net) for semantic segmentation of 3D point clouds in outdoor environments for unmanned vehicles. The network is composed of local-global feature extraction modules, and a local feature aggregation classifier. The local-global feature extraction module captures both local and global features, which can improve the accuracy and robustness of semantic segmentation. The local feature aggregation classifier considers the feature information of neighboring points to ensure clarity of object boundaries and the high overall accuracy of semantic segmentation. Experimental results show that provides clearer boundaries between various objects, and has higher identification accuracy for small sample objects. The LDE-Net has good performance for semantic segmentation of 3D point clouds in outdoor environments.
- Research Article
8
- 10.3390/s23135909
- Jun 26, 2023
- Sensors
Pixel-level information of remote sensing images is of great value in many fields. CNN has a strong ability to extract image backbone features, but due to the localization of convolution operation, it is challenging to directly obtain global feature information and contextual semantic interaction, which makes it difficult for a pure CNN model to obtain higher precision results in semantic segmentation of remote sensing images. Inspired by the Swin Transformer with global feature coding capability, we design a two-branch multi-scale semantic segmentation network (TMNet) for remote sensing images. The network adopts the structure of a double encoder and a decoder. The Swin Transformer is used to increase the ability to extract global feature information. A multi-scale feature fusion module (MFM) is designed to merge shallow spatial features from images of different scales into deep features. In addition, the feature enhancement module (FEM) and channel enhancement module (CEM) are proposed and added to the dual encoder to enhance the feature extraction. Experiments were conducted on the WHDLD and Potsdam datasets to verify the excellent performance of TMNet.
- Conference Article
- 10.1109/radar53847.2021.10028242
- Dec 15, 2021
The detection performance of small vessels in n-earshore SAR images is easily affected by land false alarms. Therefore sea-land segmentation is usually performed before vessel detection. However, it requires separate training and inference for the segmentation and detection tasks which are time-consuming. Moreover, the two tasks do not benefit each other in sharing global semantic and feature information. In this paper, a single encoder-decoder network for both sea-land segmentation and small vessel detection was proposed. The global semantic information of sea-land segmentation is introduced in the optimization process of the small vessel detection model by back-propagating the joint sea-land segmentation loss and vessel detection loss. Considering the sampling differences between sea-land segmentation and small vessel detection, a dilate-encoder module was added to the sea-land segmentation layer to extract global semantic information to optimize the sea-land segmentation performance. Experimental results on the Sentinel-l data set verified that the proposed method not only reduced miss detections and land false alarms but also improved the accuracy of sea-land segmentation.
- Research Article
4
- 10.1016/j.ecoinf.2025.103029
- May 1, 2025
- Ecological Informatics
Multiscale feature fusion and enhancement in a transformer for the fine-grained visual classification of tree species
- Research Article
15
- 10.1080/01431161.2023.2264502
- Oct 2, 2023
- International Journal of Remote Sensing
Target segmentation of remote sensing images has always been a hotspot in image processing. This paper proposes a new semantic segmentation technology for remote sensing images, which uses Unet as the backbone and combines attention mechanism and feature enhancement module. The feature enhancement module can enlarge the information of the region of interest (ROI) to improve the contrast of the image; the attention mechanism includes spatial and channel attention modules, which can obtain more detailed information of the desired target while suppressing other useless information. This paper improves the loss function of the traditional Unet. On the basis of the sparse categorical cross-entropy function, the mean squared logarithmic error function is added, which can effectively improve the accuracy of semantic segmentation. The experimental results show that the algorithm has higher computational accuracy than Unet, DeepLabV3, SegNet, PSPNet, CBAM and DAnet while having the computational speed of FCN and Unet in model testing and validation.
- Research Article
30
- 10.3389/fnins.2019.01128
- Oct 24, 2019
- Frontiers in Neuroscience
BackgroundThe purpose of the present study was to evaluate deep learning-based image-guided surgical planning for deep brain stimulation (DBS). We developed deep learning semantic segmentation-based DBS targeting and prospectively applied the method clinically.MethodsT2∗ fast gradient-echo images from 102 patients were used for training and validation. Manually drawn ground truth information was prepared for the subthalamic and red nuclei with an axial cut ∼4 mm below the anterior–posterior commissure line. A fully convolutional neural network (FCN-VGG-16) was used to ensure margin identification by semantic segmentation. Image contrast augmentation was performed nine times. Up to 102 original images and 918 augmented images were used for training and validation. The accuracy of semantic segmentation was measured in terms of mean accuracy and mean intersection over the union. Targets were calculated based on their relative distance from these segmented anatomical structures considering the Bejjani target.ResultsMean accuracies and mean intersection over the union values were high: 0.904 and 0.813, respectively, for the 62 training images, and 0.911 and 0.821, respectively, for the 558 augmented training images when 360 augmented validation images were used. The Dice coefficient converted from the intersection over the union was 0.902 when 720 training and 198 validation images were used. Semantic segmentation was adaptive to high anatomical variations in size, shape, and asymmetry. For clinical application, two patients were assessed: one with essential tremor and another with bradykinesia and gait disturbance due to Parkinson’s disease. Both improved without complications after surgery, and microelectrode recordings showed subthalamic nuclei signals in the latter patient.ConclusionThe accuracy of deep learning-based semantic segmentation may surpass that of previous methods. DBS targeting and its clinical application were made possible using accurate deep learning-based semantic segmentation, which is adaptive to anatomical variations.
- Preprint Article
- 10.21203/rs.3.rs-6535669/v1
- Jun 9, 2025
- Research Square
Medical image segmentation is a fundamental task in medical image analysis, playing a crucial role in disease diagnosis, treatment planning, and clinical decision-making. Accurate segmentation of anatomical structures, such as blood vessels, organs, and lesions, is essential for reliable medical interpretations. However, existing segmentation models often face challenges in effectively capturing both local and global features within medical images. Traditional convolutional neural networks (CNNs) excel at extracting fine-grained local features but struggle to model long-range dependencies due to their limited receptive fields. Conversely, transformer-based models can capture global contextual relationships but often fail to preserve detailed local structures, leading to suboptimal segmentation performance. The purpose of this study is to develop SEFormer, a novel hybrid network that integrates SENet, ResNet, and Transformer to enhance medical image segmentation. SEFormer aims to effectively capture both local and global features, addressing the limitations of traditional CNNs and transformer-based models.SENet is incorporated to recalibrate feature maps by adaptively emphasizing informative features, thereby compensating for the global representation limitations of CNNs. Meanwhile, the ResNet backbone ensures deep feature extraction while maintaining computational efficiency. Additionally, the Transformer module is introduced to capture long-range dependencies and enhance contextual awareness, complementing the local feature extraction capabilities of CNNs. This hybrid approach allows SEFormer to balance fine-grained spatial details with broader contextual information, leading to more precise segmentation results.Furthermore, to mitigate information loss during feature extraction and ensure a more complete and hierarchical representation of image information, we introduce a feature pyramid structure inspired by multi-scale image pyramid models. By progressively increasing the receptive field across different scales and employing skip connections, SEFormer effectively fuses multi-scale local and global features at each stage of the pyramid. This hierarchical fusion mechanism ensures a richer and more robust feature representation, which is particularly beneficial for segmenting complex medical structures such as blood vessels. We evaluate SEFormer on the CHASE_DB1 dataset, a widely used benchmark for retinal vessel segmentation. Experimental results demonstrate that SEFormer outperforms existing state-of-the-art segmentation methods, achieving a 3.25% improvement in segmentation accuracy. Additionally, we conduct ablation studies to verify the contributions of different components within SEFormer. The results show that incorporating SENet enhances feature recalibration and channel attention, while the Transformer module significantly improves global context awareness. The feature pyramid structure further contributes to performance gains by ensuring a multi-scale representation of vascular structures. Compared to conventional CNN-based methods, our approach achieves better segmentation accuracy with improved robustness to variations in vessel thickness, noise, and image contrast. Furthermore, SEFormer maintains computational efficiency, making it suitable for real-world clinical applications where both accuracy and processing speed are critical. SEFormer provides an efficient and accurate approach for medical image segmentation by leveraging the complementary strengths of CNNs, SENet, and Transformers while integrating a feature pyramid structure to maximize feature representation. Future research directions include extending SEFormer to other medical imaging modalities, such as CT and MRI scans, as well as exploring further optimization techniques to reduce computational costs while maintaining high segmentation performance. Additionally, integrating SEFormer with active learning frameworks could further improve segmentation performance by leveraging expert-annotated data more efficiently.
- Book Chapter
- 10.1007/978-981-19-6613-2_356
- Jan 1, 2023
Aiming at the problems that the classification of remote sensing image datasets is not balanced so that the accuracy of semantic segmentation is not high and the segmentation effect of scarce samples is poor. Transformer encoder with multihead self-attention mechanism is integrated into Deeplabv3+ network, and the attention mechanism is used to enhance the ability to capture global information, so as to improve the precision requirement of scarce categories in remote sensing semantic segmentation task. The algorithm firstly uniformly crops high-resolution remote sensing images into low-resolution image information for batch training and reducing information loss. Secondly, MobileNetV2, a lightweight backbone feature extraction network, is used to replace Xception network of Deeplabv3+. Finally, Transformer encoder based on multihead self-attention mechanism is connected in series with atrous spatial pyramid pooling (ASPP) of Deeplabv3+ encoder region to increase its segmentation precision by enhancing feature learning of scarce category samples in deep feature information. The experimental result shows that the proposed model called TransDeeplabv3+ gets 67.67% mIoU and 81.86% mPA on GID remote sensing dataset. Compared with Deeplabv3+ model, mIoU and mPA are improved by 9.46% and 8.52% respectively. TransDeeplabv3+ can effectively increase the precision because of the ability of increasing attention to the scarce species samples and solve the problem of the decrease of segmentation accuracy due to unbalanced data categories.KeywordsSemantic segmentationDeeplabv3+ TransformerAttention mechanismRemote sensing image
- Research Article
- 10.53106/199115992022043302004
- Apr 1, 2022
- 電腦學刊
<p>Machine translation is a hot research topic at present. Traditional machine translation methods are not effective because they require a large number of training samples. Image visual semantic information can improve the effect of the text machine translation model. Most of the existing works fuse the whole image visual semantic information into the translation model, but the image may contain different semantic objects. These different local semantic objects have different effects on the words prediction of the decoder. Therefore, this paper proposes a multi-modal machine translation model based on the image visual attention mechanism via global and local semantic information fusion. The global semantic information in the image and the local semantic information are fused into the text attention weight as the image attention. Thus, the alignment information between the hidden state of the decoder and the text of the source language is further enhanced. Experimental results on the English-German translation pair and the Indonesian-Chinese translation pair on the Multi30K dataset show that the proposed model has a better performance than the state-of-the-art multi-modal machine translation models, the BLEU values of English-German translation results and Indonesian-Chinese translation results exceed 43% and 29%, which proves the effectiveness of the proposed model.</p> <p>&nbsp;</p>
- Research Article
158
- 10.1111/mice.13003
- Apr 4, 2023
- Computer-Aided Civil and Infrastructure Engineering
Hybrid semantic segmentation for tunnel lining cracks based on Swin Transformer and convolutional neural network
- Research Article
5
- 10.3390/rs15102580
- May 15, 2023
- Remote Sensing
Profiting from the powerful feature extraction and representation capabilities of deep learning (DL), aerial image semantic segmentation based on deep neural networks (DNNs) has achieved remarkable success in recent years. Nevertheless, the security and robustness of DNNs deserve attention when dealing with safety-critical earth observation tasks. As a typical attack pattern in adversarial machine learning (AML), backdoor attacks intend to embed hidden triggers in DNNs by poisoning training data. The attacked DNNs behave normally on benign samples, but when the hidden trigger is activated, its prediction is modified to a specified target label. In this article, we systematically assess the threat of backdoor attacks to aerial image semantic segmentation tasks. To defend against backdoor attacks and maintain better semantic segmentation accuracy, we construct a novel robust generative adversarial network (RFGAN). Motivated by the sensitivity of human visual systems to global and edge information in images, RFGAN designs the robust global feature extractor (RobGF) and the robust edge feature extractor (RobEF) that force DNNs to learn global and edge features. Then, RFGAN uses robust global and edge features as guidance to obtain benign samples by the constructed generator, and the discriminator to obtain semantic segmentation results. Our method is the first attempt to address the backdoor threat to aerial image semantic segmentation by constructing the robust DNNs model architecture. Extensive experiments on real-world scenes aerial image benchmark datasets demonstrate that the constructed RFGAN can effectively defend against backdoor attacks and achieve better semantic segmentation results compared with the existing state-of-the-art methods.