DBENet: dual-branch encoder network based on the visual state space model for semantic segmentation of remote sensing image

  • Abstract
  • Literature Map
  • References
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

ABSTRACT Semantic segmentation is a fundamental issue for remote sensing images. However, modelling global and local information is a challenging task. Compared to the insufficient receptive field of traditional CNNs and the high complexity of Transformers, Mamba, built on the state space model (SSM), has emerged as an alternative for establishing long-distance dependency relationships while maintaining linear computational complexity. Thus, to better capture global and local information, we propose a remote sensing semantic segmentation method based on the Visual State Space Model, simply referred to as the dual-branch encoder network (DBENet). It employs the cross-scanning mechanism to calculate attention. Specifically, we integrate ResNet and visual state space models as dual branches into the encoder to capture global and local features. Additionally, we design the Global-local Transformer Block (GLTB) and the Feature Enhancement Block (FEB) to enhance global and local features. In the GLTB, we put the dual input into respective enhancement components. One input is used to enhance the global feature extracted by VSS (Visual State Space), while the other is utilized to enhance the local feature extracted by ResNet. The FEB employs a 1 × 1 convolution to enhance the features along the channel dimension. Extensive experiments on three public datasets, ISPRS Vaihingen, ISPRS Potsdam and LoveDA (Urban), demonstrate the effectiveness and potential of the proposed DBENet. The proposed method achieves better performance compared to state-of-the-art methods. The source code will be released after publication at https://github.com/MathMaths-liuh/MathMaths-liuh.

ReferencesShowing 10 of 22 papers
  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 169
  • 10.1007/s12544-015-0156-6
Review of remote sensing methodologies for pavement management and assessment
  • Mar 7, 2015
  • European Transport Research Review
  • E Schnebele + 3 more

  • Open Access Icon
  • Cite Count Icon 156
  • 10.1109/lgrs.2021.3063381
Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images
  • Jan 1, 2022
  • IEEE Geoscience and Remote Sensing Letters
  • Rui Li + 4 more

  • Open Access Icon
  • PDF Download Icon
  • Cite Count Icon 219
  • 10.3390/rs12071130
A Review of Remote Sensing for Environmental Monitoring in China
  • Apr 2, 2020
  • Remote Sensing
  • Jun Li + 5 more

  • Cite Count Icon 78
  • 10.1109/lgrs.2024.3414293
RS3Mamba: Visual State Space Model for Remote Sensing Image Semantic Segmentation
  • Jan 1, 2024
  • IEEE Geoscience and Remote Sensing Letters
  • Xianping Ma + 2 more

  • Open Access Icon
  • Cite Count Icon 104
  • 10.1080/10095020.2021.2017237
Land cover classification from remote sensing images based on multi-scale fully convolutional network
  • Jan 8, 2022
  • Geo-spatial Information Science
  • Rui Li + 4 more

  • Open Access Icon
  • Cite Count Icon 122
  • 10.1080/01431161.2022.2030071
A2-FPN for semantic segmentation of fine-resolution remotely sensed images
  • Feb 1, 2022
  • International Journal of Remote Sensing
  • Rui Li + 4 more

  • Open Access Icon
  • Cite Count Icon 40
  • 10.1016/j.inffus.2024.102779
Pan-Mamba: Effective pan-sharpening with state space model
  • Nov 8, 2024
  • Information Fusion
  • Xuanhua He + 8 more

  • Open Access Icon
  • Cite Count Icon 7
  • 10.1109/lgrs.2024.3505193
UNetMamba: An Efficient UNet-Like Mamba for Semantic Segmentation of High-Resolution Remote Sensing Images
  • Jan 1, 2025
  • IEEE Geoscience and Remote Sensing Letters
  • Enze Zhu + 5 more

  • Cite Count Icon 118
  • 10.1109/tnnls.2022.3155114
Central Attention Network for Hyperspectral Imagery Classification.
  • Nov 1, 2023
  • IEEE Transactions on Neural Networks and Learning Systems
  • Huan Liu + 5 more

  • Open Access Icon
  • Cite Count Icon 18625
  • 10.1109/tpami.2017.2699184
DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs.
  • Apr 27, 2017
  • IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Liang-Chieh Chen + 4 more

Similar Papers
  • Research Article
  • Cite Count Icon 1
  • 10.3390/rs17061019
GLFFNet: Global–Local Feature Fusion Network for High-Resolution Remote Sensing Image Semantic Segmentation
  • Mar 14, 2025
  • Remote Sensing
  • Saifeng Zhu + 4 more

Although hybrid models based on convolutional neural network (CNN) and Transformer can extract features encompassing both global and local information, they still face two challenges in addressing the semantic segmentation task of high-resolution remote sensing (HR2S) images. First, they are limited by the loss of detailed information during encoding, resulting in inadequate utilization of features. Second, the ineffective fusion of local and global context information leads to unsatisfactory segmentation performance. To simultaneously address these two challenges, we propose a dual-branch network named global–local feature fusion network (GLFFNet) for HR2S image semantic segmentation. Specifically, we use the residual network (ResNet) as the main branch to extract local features. Recently, a Mamba architecture based on State Space Models has shown significant potential in image semantic segmentation tasks. Given that Mamba is capable of handling long-range relationships with linear computational complexity and relatively high speed, we introduce VMamba as an auxiliary branch encoder to provide global information for the main branch. Meanwhile, in order to utilize global information efficiently, we propose a multi-scale feature refinement (MSFR) module to reduce the loss of details during global feature extraction. Additionally, we develop a semantic bridging fusion (SBF) module to promote the full integration of global and local features, resulting in more comprehensive and refined feature representations. Comparative experiments on three public datasets demonstrate the segmentation accuracy and application potential of GLFFNet. Specifically, GLFFNet achieves mIoU scores of 84.01% on ISPRS Vaihingen, 87.54% on ISPRS Potsdam, and 54.73% on LoveDA, as well as mF1 scores of 91.11%, 93.23%, and 70.07% on these respective datasets.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 5
  • 10.3390/rs15194649
GLF-Net: A Semantic Segmentation Model Fusing Global and Local Features for High-Resolution Remote Sensing Images
  • Sep 22, 2023
  • Remote Sensing
  • Wanying Song + 4 more

Semantic segmentation of high-resolution remote sensing images holds paramount importance in the field of remote sensing. To better excavate and fully fuse the features in high-resolution remote sensing images, this paper introduces a novel Global and Local Feature Fusion Network, abbreviated as GLF-Net, by incorporating the extensive contextual information and refined fine-grained features. The proposed GLF-Net, devised as an encoder–decoder network, employs the powerful ResNet50 as its baseline model. It incorporates two pivotal components within the encoder phase: a Covariance Attention Module (CAM) and a Local Fine-Grained Extraction Module (LFM). And an additional wavelet self-attention module (WST) is integrated into the decoder stage. The CAM effectively extracts the features of different scales from various stages of the ResNet and then encodes them with graph convolutions. In this way, the proposed GLF-Net model can well capture the global contextual information with both universality and consistency. Additionally, the local feature extraction module refines the feature map by encoding the semantic and spatial information, thereby capturing the local fine-grained features in images. Furthermore, the WST maximizes the synergy between the high-frequency and the low-frequency information, facilitating the fusion of global and local features for better performance in semantic segmentation. The effectiveness of the proposed GLF-Net model is validated through experiments conducted on the ISPRS Potsdam and Vaihingen datasets. The results verify that it can greatly improve segmentation accuracy.

  • Research Article
  • 10.1080/15481603.2025.2484829
DiffMamba: semantic diffusion guided feature modeling network for semantic segmentation of remote sensing images
  • Apr 8, 2025
  • GIScience & Remote Sensing
  • Zhen Wang + 3 more

With the rapid development of remote sensing technology, the application scope of high-resolution remote sensing images (HR-RSIs) has been continuously expanding. The emergence of convolutional neural networks and Transformer models has significantly enhanced the accuracy of semantic segmentation. However, these methods primarily focus on local feature extraction and long-range dependency modeling of global information, neglecting the spatial correlation of local features, which leads to poor segmentation of small-scale regions. To address this issue, based on Diffusion Model and State Space Model (SSM), we propose a semantic diffusion guided feature modeling network (DiffMamba) for HR-RSI semantic segmentation. DiffMamba uses a hybrid CNNs-Transformer as the encoder structure, and is equipped with the efficient phase sensing module (EPSM), the multi-view transformer module (MVTrans), the semantic diffusion alignment module (SDAM), and the coordinate state space model (CAMamba). EPSM focuses on enhancing local feature representation in the channel dimension, using the phase information of object region features to improve local information interaction and filter out clutter noise interference. MVTrans can observe the spatial location information of the object region from various perspectives to obtain refined global context details. SDAM utilizes the diffusion propagation process to fuse local and global information, alleviating the feature redundancy caused by semantic information differences. CAMamba employs state space transformation to construct the correlation of enhanced local features, and guides the model to achieve feature decoding to obtain refined semantic segmentation results. Extensive experiments on the widely used ISPRS 2-D Semantic Labeling dataset and the 15-Class Gaofen Image dataset confirm the superior efficiency of DiffMamba over several state-of-the-art methods.

  • Research Article
  • 10.1002/mp.17761
An enhanced visual state space model for myocardial pathology segmentation in multi-sequence cardiac MRI.
  • Mar 19, 2025
  • Medical physics
  • Shuning Li + 5 more

Myocardial pathology (scar and edema) segmentation plays a crucial role in the diagnosis, treatment, and prognosis of myocardial infarction (MI). However, the current mainstream models for myocardial pathology segmentation have the following limitations when faced with cardiac magnetic resonance(CMR) images with multiple objects and large changes in object scale: the remote modeling ability of convolutional neural networks is insufficient, and the computational complexity of transformers is high, which makes myocardial pathology segmentationchallenging. This study aims to develop a novel model to address the image characteristics and algorithmic challenges faced in the myocardial pathology segmentation task and improve the accuracy and efficiency of myocardial pathologysegmentation. We developed a novel visual state space (VSS)-based deep neural network, MPS-Mamba. In order to accurately and adequately extract CMR image features, the encoder employs a dual-branch structure to extract global and local features of the image. Among them, the VSS branch overcomes the limitations of the current mainstream models for myocardial pathology segmentation by modeling remote relationships through linear computability, while the convolutional-based branch provides complementary local information. Given the unique properties of the dual branches, we design a modular dual-branch fusion module for fusing dual branches to enhance the feature representation of the dual encoder. To improve the ability to model objects of different scales in cardiac magnetic resonance (CMR) images, a multi-scale feature fusion (MSF) module is designed to achieve effective integration and fine expression of multi-scale information. To further incorporate anatomical knowledge to optimize segmentation results, a decoder with three decoding branches is designed to output segmentation results of scar, edema, and myocardium, respectively. In addition, multiple sets of constraint functions are used to not only improve the segmentation accuracy of myocardial pathology but also effectively model the spatial position relationship between myocardium, scar, andedema. The proposed method was comprehensively evaluated on the MyoPS 2020 dataset, and the results showed that MPS-Mamba achieved an average Dice score of 0.717 0.169 in myocardial scar segmentation, which is superior to the current mainstream methods. In addition, MPS-Mamba also performed well in the edema segmentation task, with an average Dice score of 0.735 0.073. The experimental results further demonstrate the effectiveness of MPS-Mamba in segmenting myocardial pathologies in multi-sequence CMR images, verifying its advantages in myocardial pathology segmentationtasks. Given the effectiveness and superiority of MPS-Mamba, this method is expected to become a potential myocardial pathology segmentation tool that can effectively assist clinical diagnosis.

  • Research Article
  • Cite Count Icon 24
  • 10.1109/tip.2020.2965306
Joint Coding of Local and Global Deep Features in Videos for Visual Search.
  • Jan 1, 2020
  • IEEE Transactions on Image Processing
  • Lin Ding + 4 more

Practically, it is more feasible to collect compact visual features rather than the video streams from hundreds of thousands of cameras into the cloud for big data analysis and retrieval. Then the problem becomes which kinds of features should be extracted, compressed and transmitted so as to meet the requirements of various visual tasks. Recently, many studies have indicated that the activations from the convolutional layers in convolutional neural networks (CNNs) can be treated as local deep features describing particular details inside an image region, which are then aggregated (e.g., using Fisher Vectors) as a powerful global descriptor. Combination of local and global features can satisfy those various needs effectively. It has also been validated that, if only local deep features are coded and transmitted to the cloud while the global features are recovered using the decoded local features, the aggregated global features should be lossy and consequently would degrade the overall performance. Therefore, this paper proposes a joint coding framework for local and global deep features (DFJC) extracted from videos. In this framework, we introduce a coding scheme for real-valued local and global deep features with intra-frame lossy coding and inter-frame reference coding. The theoretical analysis is performed to understand how the number of inliers varies with the number of local features. Moreover, the inter-feature correlations are exploited in our framework. That is, local feature coding can be accelerated by making use of the frame types determined with global features, while the lossy global features aggregated with the decoded local features can be used as a reference for global feature coding. Extensive experimental results under three metrics show that our DFJC framework can significantly reduce the bitrate of local and global deep features from videos while maintaining the retrieval performance.

  • Research Article
  • Cite Count Icon 2
  • 10.1088/1361-6560/ad4d53
Automatic breast ultrasound (ABUS) tumor segmentation based on global and local feature fusion
  • May 30, 2024
  • Physics in Medicine & Biology
  • Yanfeng Li + 5 more

Accurate segmentation of tumor regions in automated breast ultrasound (ABUS) images is of paramount importance in computer-aided diagnosis system. However, the inherent diversity of tumors and the imaging interference pose great challenges to ABUS tumor segmentation. In this paper, we propose a global and local feature interaction model combined with graph fusion (GLGM), for 3D ABUS tumor segmentation. In GLGM, we construct a dual branch encoder-decoder, where both local and global features can be extracted. Besides, a global and local feature fusion module is designed, which employs the deepest semantic interaction to facilitate information exchange between local and global features. Additionally, to improve the segmentation performance for small tumors, a graph convolution-based shallow feature fusion module is designed. It exploits the shallow feature to enhance the feature expression of small tumors in both local and global domains. The proposed method is evaluated on a private ABUS dataset and a public ABUS dataset. For the private ABUS dataset, the small tumors (volume smaller than 1 cm3) account for over 50% of the entire dataset. Experimental results show that the proposed GLGM model outperforms several state-of-the-art segmentation models in 3D ABUS tumor segmentation, particularly in segmenting small tumors.

  • Research Article
  • Cite Count Icon 25
  • 10.1007/s11042-019-7674-5
Human action recognition using bag of global and local Zernike moment features
  • May 15, 2019
  • Multimedia Tools and Applications
  • Saleh Aly + 1 more

Human action recognition is a fundamental and challenging building block for many computer vision applications. It has been included in many applications such as: video surveillance, human computer interaction and multimedia retrieval systems. Various approaches have been proposed to solve human action recognition problem. Among others, moment-based methods considered as one of the most simple and successful approach. However, moment-based methods take into consideration only global features while neglect the discriminative properties of local features. In this paper, we propose a new efficient method which combine both Global and Local Zernike Moment (GLZM) features based on Bag-of-Features (BoF) technique. Since using only global features are not sufficient to discriminate similar actions like running, walking and jogging, augmenting these features with localized features helps to improve the recognition accuracy. The proposed method first calculate local temporal Motion Energy Images (MEI) by accumulating frame differences of short time consecutive frames. Then, global and local features are calculated using Zernike moments with different polynomial orders to represent global and local motion patterns respectively. Global features are calculated from the whole region of the human performing action while local features focused on localized regions of the human in order to represent local motion information. Both local and global features are preprocessed using whitening transformation, then bag-of-features algorithm is employed to combine those pool of features and represent each action using new GLZM feature descriptor. Finally, we use multi-class Support Vector Machine (SVM) classifier to recognize human actions. In order to validate the proposed method, we perform a set of experiments using three publicly available datasets: Weizmann, KTH and UCF sports action. Experimental results using leave-one-out strategy show that proposed method achieves promising results compared with other state-of-the-art methods.

  • Peer Review Report
  • Cite Count Icon 1
  • 10.7554/elife.78635.sa2
Author response: A connectomics-based taxonomy of mammals
  • Oct 10, 2022
  • Laura E Suarez + 6 more

Author response: A connectomics-based taxonomy of mammals

  • Research Article
  • Cite Count Icon 47
  • 10.1016/j.heliyon.2024.e38495
Samba: Semantic segmentation of remotely sensed images with state space model
  • Sep 26, 2024
  • Heliyon
  • Qinfeng Zhu + 6 more

High-resolution remotely sensed images pose challenges to traditional semantic segmentation networks, such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). CNN-based methods struggle to handle high-resolution images due to their limited receptive field, while ViT-based methods, despite having a global receptive field, face challenges when processing long sequences. Inspired by the Mamba network, which is based on a state space model (SSM) to efficiently capture global semantic information, we propose a semantic segmentation framework for high-resolution remotely sensed imagery, named Samba. Samba utilizes an encoder-decoder architecture, with multiple Samba blocks serving as the encoder to efficiently extract multi-level semantic information, and UperNet functioning as the decoder. We evaluate Samba on the LoveDA, ISPRS Vaihingen, and ISPRS Potsdam datasets using the mIoU and mF1 metrics, and compare it with top-performing CNN-based and ViT-based methods. The results demonstrate that Samba achieves unparalleled performance on commonly used remotely sensed datasets for semantic segmentation. Samba is the first to demonstrate the effectiveness of SSM in segmenting remotely sensed imagery, setting a new performance benchmark for Mamba-based techniques in this domain of semantic segmentation. The source code and baseline implementations are available at https://github.com/zhuqinfeng1999/Samba.

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/iccchina.2018.8641146
ResNet with Global and Local Image Features, Stacked Pooling Block, for Semantic Segmentation
  • Aug 1, 2018
  • Hui Song + 4 more

Recently, deep convolutional neural networks (CNNs) have achieved great success in semantic segmentation systems. In this paper, we show how to improve pixel-wise semantic segmentation by combine both global context information and local image features. First, we implement a fusion layer that allows us to merge global features and local features in encoder network. Second, in decoder network, we introduce a stacked pooling block, which is able to significantly expand the receptive fields of features maps and is essential to contextualize local semantic predictions. Furthermore, our approach is based on ResNet18, which makes our model have much less parameters than current published models. The whole framework is trained in an end-to-end fashion without any post-processing. We show that our method improves the performance of semantic image segmentation on two datasets CamVid and Cityscapes, which demonstrate its effectiveness.

  • Research Article
  • 10.11834/jig.211170
Fashion clothing matching by global-local feature optimization
  • Jan 1, 2023
  • Journal of Image and Graphics
  • Yunzhu Wang + 4 more

目的 由于现有时尚服饰搭配方法缺乏服饰图像局部细节的有效特征表示,难以对不同服饰间的局部兼容性进行建模,限制了服饰兼容性学习的完备性,导致时尚服饰搭配的准确率较低。因此,提出一种全局—局部特征优化的时尚服饰搭配方法。方法 首先,利用不同卷积网络提取时尚服饰的图像和文本特征作为全局特征,同时在卷积网络基础上构建局部特征提取网络,提取时尚服饰图像的局部特征;然后,基于图网络和自注意力机制构建全局—局部兼容性学习模块,通过学习不同时尚服饰全局特征间和局部特征间的交互关系,并定义不同时尚服饰的权重,进行服饰全局和局部兼容性建模;最后,构建服饰搭配优化模型,通过融合套装中所有服饰的全局和局部兼容性优化服饰搭配,并计算搭配得分,输出正确的服饰搭配结果。结果 在公开数据集Polyvore上将本文方法与其他方法进行对比。实验结果表明,利用局部特征提取网络提取的时尚服饰图像局部特征能有效地表示服饰局部信息;构建的全局—局部兼容性学习模块对时尚服饰的全局兼容性和局部兼容性进行了完整建模;构建的时尚服饰搭配优化模型实现了全局和局部兼容性的优化组合,使时尚服饰搭配准确率(fill in the blank,FITB)提高至86.89%。结论 本文提出的全局—局部特征优化的时尚服饰搭配方法,能够有效提高时尚服饰搭配的准确率,较好地满足日常时尚搭配的需求。;Objective Fashion clothing matching has been developing for clothing-relevant fashion research nowadays. Fashion clothing matching studies are required to learn the complex matching relationship(i. e. ,fashion compatibility) among different fashion items in an representation-based outfit. Fashion items have rich partial designs and matching relationships among partial designs. To analyze their global compatibility learning,most of the existing researches are concerned of items’global features(visual and textual features). But,local feature extraction is often ignored for local compatibility,which causes lower performance and accuracy of fashion style matching. Therefore,we develop a fashion style matching method in terms of global-local feature optimization and it is aimed to extract the local features of fashion images for local information representing,construct the local compatibility of fashion items,and improve the global and local compatibility-incorporated accuracy of fashion style matching. Method First,we use two different convolutional neural networks(CNNs)to extract the global features of fashion items separately on the basis of the input fashion images and texts. To extract CNN-based local features of fashion images,a multiple branches-related local feature extraction network is designed. A branch of the local feature extraction network is composed of 1)a convolution layer,2)a batch normalization (BN)layer,and 3)a rectified linear unit(ReLU)activation function. A branch can be used to extract a local feature in the fashion image,and different branches can be used to extract different local features of the fashion image. Second,a global-local compatibility learning module is constructed in terms of graph neural network(GNN)and self-attention mechanism(SAM),which can model both of the global and local compatibility. GNN is used to model interactions among global features and local features separately. The SAM-based weight information of different fashion items is defined and integrated into the modeling,and the item’s global and local compatibility are obtained both. Finally,a fashion clothing matching optimization model is built up to gain optimized matching results. The learned outfit global and local compatibility can be used to integrate all fashion items’global compatibility and local compatibility separately in an outfit. To optimize matching results,the trade-off parameters are then defined to adjust the impact of the outfit global compatibility and local compatibility on fashion style matching. At the same time,the matching score is calculated as well. Different matching schemes have different matching scores,and the optimized fashion style matching result is generated according to the highest score. Result The proposed method is validated on the public Polyvore dataset that includes fashion item images and textual descriptions. The details are presented as follows. The local features of fashion items extracted by our local feature extraction network can represent the fashion items’local information effectively without label-attributed supervision. Our global-local compatibility learning module can be used to learn the fashion item’s global compatibility and local compatibility at the same time,and the weights of different fashion items is involved,which can model the fashion global and local compatibility completely. The fill in the blank(FITB)accuracy ratio of fashion style matching is improved to 86. 89%. Conclusion A fashion clothing matching method is developed in terms of global local feature optimization. First,we construct a local feature extraction network to extract local features of fashion images while the global features of fashion items are extracted. Next,the self-attention mechanism is introduced to weight different fashion items after the global matching relationships and local matching relationships of fashion items with graph network are analyzed,which constructs the global and local compatibilities of fashion items completely. Finally,to obtain the global and local compatibilities of the outfit, our fashion clothing matching optimization model is used to fuse the item’s global compatibility and local compatibility each in an outfit. To optimize matching results,the effectiveness of the two kinds of compatibilities on fashion clothing matching with parameters is adjusted as well. The convergence speed of our method is still slow. The optimization model is only used to combine the global and local compatibility of the outfit linearly. In practice,the relationship between the global compatibility and local compatibility is more complex. To improve the accuracy of fashion clothing matching,future work can be focused on the convergence speed and the clothing matching optimization further.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 6
  • 10.3390/rs15061598
Semantic Segmentation of High-Resolution Remote Sensing Images Based on Sparse Self-Attention and Feature Alignment
  • Mar 15, 2023
  • Remote Sensing
  • Li Sun + 6 more

Semantic segmentation of high-resolution remote sensing images (HRSI) is significant, yet challenging. Recently, several research works have utilized the self-attention operation to capture global dependencies. HRSI have complex scenes and rich details, and the implementation of self-attention on a whole image will introduce redundant information and interfere with semantic segmentation. The detail recovery of HRSI is another challenging aspect of semantic segmentation. Several networks use up-sampling, skip-connections, parallel structure, and enhanced edge features to obtain more precise results. However, the above methods ignore the misalignment of features with different resolutions, which affects the accuracy of the segmentation results. To resolve these problems, this paper proposes a semantic segmentation network based on sparse self-attention and feature alignment (SAANet). Specifically, the sparse position self-attention module (SPAM) divides, rearranges, and resorts the feature maps in the position dimension and performs position attention operations (PAM) in rearranged and restored sub-regions, respectively. Meanwhile, the proposed sparse channel self-attention module (SCAM) groups, rearranges, and resorts the feature maps in the channel dimension and performs channel attention operations (CAM) in the rearranged and restored sub-channels, respectively. SPAM and SCAM effectively model long-range context information and interdependencies between channels, while reducing the introduction of redundant information. Finally, the feature alignment module (FAM) utilizes convolutions to obtain a learnable offset map and aligns feature maps with different resolutions, helping to recover details and refine feature representations. Extensive experiments conducted on the ISPRS Vaihingen, Potsdam, and LoveDA datasets demonstrate that the proposed method precedes general semantic segmentation- and self-attention-based networks.

  • Conference Article
  • Cite Count Icon 103
  • 10.1109/iccv48922.2021.01156
DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local and Global Features
  • Oct 1, 2021
  • Min Yang + 7 more

Image Retrieval is a fundamental task of obtaining images similar to the query one from a database. A common image retrieval practice is to firstly retrieve candidate images via similarity search using global image features and then re-rank the candidates by leveraging their local features. Previous learning-based studies mainly focus on either global or local image representation learning to tackle the retrieval task. In this paper, we abandon the two-stage paradigm and seek to design an effective single-stage solution by integrating local and global information inside images into compact image representations. Specifically, we propose a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval. It attentively extracts representative local information with multi-atrous convolutions and self-attention at first. Components orthogonal to the global image representation are then extracted from the local information. At last, the orthogonal components are concatenated with the global representation as a complementary, and then aggregation is performed to generate the final representation. The whole framework is end-to-end differentiable and can be trained with image-level labels. Extensive experimental results validate the effectiveness of our solution and show that our model achieves state-of-the-art image retrieval performances on Revisited Oxford and Paris datasets. <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sup>

  • Research Article
  • Cite Count Icon 11
  • 10.1016/j.engappai.2023.107638
Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation
  • Dec 5, 2023
  • Engineering Applications of Artificial Intelligence
  • Junyu Fan + 3 more

Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation

  • Conference Article
  • 10.1109/ialp57159.2022.9961242
Biomedical Entity Linking Based on Global and Local Feature Fusion
  • Oct 27, 2022
  • Turdi Tohti + 2 more

Most existing biomedical entity linking methods only consider local or global features in feature extraction. To address these problems, this paper proposes a biomedical entity linking method that fuses local and global features. Firstly, the candidate entities are generated by the alignment cosine similarity method. Secondly, for each entity, a triplet of positive and negative entities is introduced as the input sample, then the global semantic relevance representation is obtained by the BioBERT pre-training model respectively, and subsequently the local features are extracted by the residual convolutional neural network. Finally, the global and local features are spliced and trained using a triplet loss function. The experimental results show that the method outperforms the baseline method with an accuracy of 91.56% and 93.73% on the NCBI and ADR biomedical entity linking benchmark datasets, respectively. The proposed biomedical entity linkage model effectively alleviates the problem that traditional models ignore global or local information.

More from: International Journal of Remote Sensing
  • New
  • Research Article
  • 10.1080/01431161.2025.2580584
Impact of land use and land cover changes on sensible heat variability in a fragment of the Atlantic Forest biome
  • Nov 7, 2025
  • International Journal of Remote Sensing
  • Gabriela Gomes + 6 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2583600
Enhancing tree species composition mapping using Sentinel-2 and multi-seasonal deep learning fusion
  • Nov 7, 2025
  • International Journal of Remote Sensing
  • Yuwei Cao + 3 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2579800
Wind-aware UAV photogrammetry planning: minimising motion blur for effective terrain surveying
  • Nov 6, 2025
  • International Journal of Remote Sensing
  • Enrique Aldao + 7 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2581401
A land surface effective temperature calculation method to improve microwave emissivity retrieval over barren areas
  • Nov 3, 2025
  • International Journal of Remote Sensing
  • Xueying Wang + 3 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2580779
Dual stage adversarial domain adaptation for multi-model Hyperspectral image classification
  • Nov 1, 2025
  • International Journal of Remote Sensing
  • Wen Xie + 2 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2564908
RMRN-DETR: regression-optimized remote sensing image detection network based on multi-dimensional real-time detection and domain adaptation
  • Oct 31, 2025
  • International Journal of Remote Sensing
  • Muzi Chen + 9 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2571231
Within-field crop leaf area index simulation using a hybrid PROSAIL-SVR approach: evaluating Sentinel-2 and PlanetScope potential
  • Oct 30, 2025
  • International Journal of Remote Sensing
  • Rahul Raj + 5 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2580780
Managing methane concentrations in western Canada: climate actions towards a net-zero target
  • Oct 30, 2025
  • International Journal of Remote Sensing
  • Amir Ghahremanlou + 1 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2575514
RAGCap: retrieval-augmented generation for style-aware remote sensing image captioning without fine-tuning
  • Oct 30, 2025
  • International Journal of Remote Sensing
  • Yakoub Bazi + 2 more

  • New
  • Research Article
  • 10.1080/01431161.2025.2579807
Multimodal fusion network with learnable wavelet-enhanced features for hyperspectral unmixing
  • Oct 30, 2025
  • International Journal of Remote Sensing
  • Zhixiang Wang + 4 more

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.

Search IconWhat is the difference between bacteria and viruses?
Open In New Tab Icon
Search IconWhat is the function of the immune system?
Open In New Tab Icon
Search IconCan diabetes be passed down from one generation to the next?
Open In New Tab Icon