Spatial Plaid Attention Decoder for Semantic Segmentation

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Striking a balance between efficiency and accuracy is a challenge in the design and implementation of decoders. Accurate decoders often tend to be highly complex and computationally costly. This paper presents a novel decoder for semantic segmentation: Spatial Plaid Attention Decoder (SPADe). We propose a Spatial Plaid Attention module that performs an efficient local feature collection through spatial feature folding to help the model capture the local structure of objects while using long-range feature aggregation to consider the global structures efficiently and accurately. This makes SPADe a suitable choice for applications with limited resources. With a size of 7.3% UPerNet decoder, our SPADe obtains state-of-the-art performance with several popular backbones on public benchmarks. On Cityscapes and ADE20K, SPADe obtains 84.3% and $53.8 \% \mathrm{mIoU}$, while reducing the total GFlops by 32.8% and 70.5%, respectively. We also demonstrate that the effective design of SPADe allows it to capture long-range dependencies with a large receptive field. Implementation of the SPADe is available at github/SPADe.

Similar Papers
  • Conference Article
  • Cite Count Icon 4
  • 10.1109/iv51971.2022.9827377
Fusion Attention Network for Autonomous Cars Semantic Segmentation
  • Jun 5, 2022
  • Chuyao Wang + 1 more

Semantic segmentation is vital for autonomous car scene understanding. It provides more precise subject information than raw RGB images and this, in turn, boosts the performance of autonomous driving. Recently, self-attention methods show great improvement in image semantic segmentation. Attention maps help scene parsing with abundant relationships of every pixel in an image. However, it is computationally demanding. Besides, existing works focus either on channel attention, ignoring the pixel position factors, or on spatial attention, disregarding the impacts of the channels on each other. To address these problems, we present Fusion Attention Network based on self-attention mechanism to harvest rich contextual dependencies. This model consists of two chains: pyramid fusion spatial attention and fusion channel attention. We apply pyramid sampling in the spatial attention module to reduce the computation for spatial attention maps. Channel attention has a similar structure to the spatial attention. We also introduce a fusion technique to calculate contextual dependencies using features from both attention chains. We concatenate the results from spatial and channel attention modules as the enhanced attention map, leading to better semantic segmentation results. We conduct extensive experiments on popular datasets with different settings in addition to an ablation study to prove the efficiency of our approach. Our model achieves better results, on Cityscapes [7], compared to state-of-the-art methods, and also show good generalization capability on PASCAL VOC 2012 [9].

  • Research Article
  • Cite Count Icon 10
  • 10.1016/j.bspc.2024.106163
Spider-Net: High-resolution multi-scale attention network with full-attention decoder for tumor segmentation in kidney, liver and pancreas
  • Feb 28, 2024
  • Biomedical Signal Processing and Control
  • Yanjun Peng + 5 more

Spider-Net: High-resolution multi-scale attention network with full-attention decoder for tumor segmentation in kidney, liver and pancreas

  • Research Article
  • Cite Count Icon 2
  • 10.1016/j.eswa.2021.116438
Pixel Voting Decoder: A novel decoder that regresses pixel relationships for segmentation
  • Dec 31, 2021
  • Expert Systems With Applications
  • Pengfei Xian + 8 more

Pixel Voting Decoder: A novel decoder that regresses pixel relationships for segmentation

  • Conference Article
  • Cite Count Icon 1
  • 10.1145/3373509.3373535
Accurate Semantic Segmentation in Remote Sensing Image
  • Oct 23, 2019
  • Shuqi Wang + 2 more

Thanks to recent development in CNNs and deep learning, solid improvements have been made in semantic segmentation, however, most of the previous works in semantic segmentation are for automatic driving and do not fully taken into account the specific difficulties that exist in high resolution remote sensing imagery. One of such difficulties is that objects are small, crowded and intra-class scale difference in remote sensing imagery. To tackle with this challenging task, we have proposed a novel architecture which adopts encoder-decoder structure, multi-scale dilated convolution with spatial attention and separable convolution (Global Attention Pyramid) and channel attention decoder (Attention Decoder). The proposed Global Attention Pyramid module solves these problems by enlarging receptive field without reducing resolution of feature maps and pixel-level attention. And the proposed Attention Decoder module solves these problems by providing global context to select category localization details. We tested our network on two satellite imagery datasets and acquired remarkably good results for both datasets especially for small objects. And our new network improves the performance from 0.6341 to 0.6510 in DEEPGLOBE road extraction dataset.

  • Conference Article
  • Cite Count Icon 10
  • 10.5220/0007366003930400
Design of Real-time Semantic Segmentation Decoder for Automated Driving
  • Jan 1, 2019
  • Arindam Das + 3 more

Semantic segmentation remains a computationally intensive algorithm for embedded deployment even with the rapid growth of computation power. Thus efficient network design is a critical aspect especially for applications like automated driving which requires real-time performance. Recently, there has been a lot of research on designing efficient encoders that are mostly task agnostic. Unlike image classification and bounding box object detection tasks, decoders are computationally expensive as well for semantic segmentation task. In this work, we focus on efficient design of the segmentation decoder and assume that an efficient encoder is already designed to provide shared features for a multi-task learning system. We design a novel efficient non-bottleneck layer and a family of decoders which fit into a small run-time budget using VGG10 as efficient encoder. We demonstrate in our dataset that experimentation with various design choices led to an improvement of 10% from a baseline performance.

  • Research Article
  • Cite Count Icon 47
  • 10.1609/aaai.v37i2.25321
FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation
  • Jun 26, 2023
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Jae-Hun Shim + 3 more

With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/imcom51814.2021.9377401
TSS-Net: Time-based Semantic Segmentation Neural Network for Road Scene Understanding
  • Jan 4, 2021
  • Tin Trung Duong + 2 more

In this research, a multitask convolutional neural network that can do end-to-end road scene classification and semantic segmentation, which are the two crucial tasks for advanced driver assistance systems (ADAS), is proposed. We name the network TSS which means time-based semantic segmentation. The network contains three main modules: an image encoder, a scene classifier, and two time-based segmentation decoders. For each road scene image, the encoder extracts image features which will be used for classifier and decoders. Next, the image features are fed to the classifier to predict the scene type (in this case a day or a night scene). Then, based on the predicted scene type, the same extracted features are fed to a corresponding segmentation decoder to produce the final semantic segmentation result. By using this classification-driven decoder approach, we can improve the accuracy of the segmentation model, even when the model has been trained excessively earlier. Through the experiment, the validity of our proposed method has been proven. Our approach can be considered as stacking multiple segmentation modules on top of the classification module with all of them share the same image encoder. With this approach, we can utilize the result from classification to gain more accuracy in segmentation in one feed forward only.

  • Conference Article
  • Cite Count Icon 3
  • 10.24963/ijcai.2023/74
Decoupling with Entropy-based Equalization for Semi-Supervised Semantic Segmentation
  • Aug 1, 2023
  • Chuanghao Ding + 6 more

Semi-supervised semantic segmentation methods are the main solution to alleviate the problem of high annotation consumption in semantic segmentation. However, the class imbalance problem makes the model favor the head classes with sufficient training samples, resulting in poor performance of the tail classes. To address this issue, we propose a Decoupled Semi-Supervise Semantic Segmentation (DeS4) framework based on the teacher-student model. Specifically, we first propose a decoupling training strategy to split the training of the encoder and segmentation decoder, aiming at a balanced decoder. Then, a non-learnable prototype-based segmentation head is proposed to regularize the category representation distribution consistency and perform a better connection between the teacher model and the student model. Furthermore, a Multi-Entropy Sampling (MES) strategy is proposed to collect pixel representation for updating the shared prototype to get a class-unbiased head. We conduct extensive experiments of the proposed DeS4 on two challenging benchmarks (PASCAL VOC 2012 and Cityscapes) and achieve remarkable improvements over the previous state-of-the-art methods.

  • Research Article
  • Cite Count Icon 35
  • 10.1609/aaai.v36i1.19985
Channelized Axial Attention – considering Channel Relation within Spatial Attention for Semantic Segmentation
  • Jun 28, 2022
  • Proceedings of the AAAI Conference on Artificial Intelligence
  • Ye Huang + 4 more

Spatial and channel attentions, modelling the semantic interdependencies in spatial and channel dimensions respectively, have recently been widely used for semantic segmentation. However, computing spatial and channel attentions separately sometimes causes errors, especially for those difficult cases. In this paper, we propose Channelized Axial Attention (CAA) to seamlessly integrate channel attention and spatial attention into a single operation with negligible computation overhead. Specifically, we break down the dot-product operation of the spatial attention into two parts and insert channel relation in between, allowing for independently optimized channel attention on each spatial location. We further develop grouped vectorization, which allows our model to run with very little memory consumption without slowing down the running speed. Comparative experiments conducted on multiple benchmark datasets, including Cityscapes, PASCAL Context, and COCO-Stuff, demonstrate that our CAA outperforms many state-of-the-art segmentation models (including dual attention) on all tested datasets.

  • Research Article
  • 10.3390/app15042012
MCDCNet: Mask Classification Combined with Adaptive Dilated Convolution for Image Semantic Segmentation
  • Feb 14, 2025
  • Applied Sciences
  • Geng Wei + 5 more

Effectively classifying each pixel in an image is an important research topic in semantic segmentation. The Existing methods typically require the network to directly generate a feature map of the same size as the original image and classify each pixel, which makes it difficult for the network to fully leverage the representations from the backbone. To handle this challenge, this paper proposes a method named mask classification combined with an adaptive dilated convolution network (MCDCNet). Firstly, a Vision Transformer (ViT)-based module is employed to capture contextual features as the backbone. Secondly, the Spatial Extraction Module (SEM) is proposed to extract multi-scale spatial information through adaptive dilated convolution while preserving the original feature size. This spatial information is then integrated into the corresponding contextual features to enhance the representation. Finally, a novel inference process is proposed that incorporates the instance activation map (IAM)-based decoder for semantic segmentation, thereby enhancing the network’s capability to capture and comprehend semantic features. The experimental results demonstrate that our network significantly outperforms other per-pixel classification networks across several semantic segmentation datasets. In particular, on Cityscapes, MCDCNet achieves 80.3 mIoU with 11.8 M Params, demonstrating that the network is able to deliver a strong segmentation performance while maintaining a relatively low parameter count.

  • Conference Article
  • Cite Count Icon 5
  • 10.1109/ieeeconf44664.2019.9048981
MEDA: Multi-output Encoder-Decoder for Spatial Attention in Convolutional Neural Networks
  • Nov 1, 2019
  • Huayu Li + 1 more

Utilizing channel-wise spatial attention mechanisms to emphasize special parts of an input image is an effective method to improve the performance of convolutional neural networks (CNNs). There are multiple effective implementations of attention mechanism. One is adding squeeze-and-excitation (SE) blocks to the CNN structure that selectively emphasize the most informative channels and suppress the relatively less informative channels by taking advantage of channel dependence. Another method is adding convolutional block attention module (CBAM) to implement both channel-wise and spatial attention mechanisms to select important pixels of the feature maps while emphasizing informative channels. In this paper, we propose an encoder-decoder architecture based on the idea of letting the channel-wise and spatial attention blocks share the same latent space representation. Instead of separating the channel-wise and spatial attention modules into two independent parts in CBAM, we combine them into one encoder-decoder architecture with two outputs. To evaluate the performance of the proposed algorithm, we apply it to different CNN architectures and test it on image classification and semantic segmentation. Through comparing the resulting structure equipped with MEDA blocks against other attention module, we show that the proposed method achieves better performance across different test scenarios.

  • Research Article
  • Cite Count Icon 6
  • 10.1007/s11063-025-11748-8
SFA: Efficient Attention Mechanism for Superior CNN Performance
  • Apr 4, 2025
  • Neural Processing Letters
  • Wei Xu + 2 more

Attention mechanisms are critical tools for enhancing the performance of convolutional neural networks (CNNs), focusing on spatial and channel dimensions of feature maps, known as spatial attention and channel attention, respectively. While many advanced attention methods combine these dimensions to improve performance, particularly in downstream computer vision tasks, such methods often introduce significant computational overhead or fail to effectively capture long-range spatial dependencies alongside channel attention. To address these challenges, this paper proposes the sequential fusion attention (SFA) method, which introduces a complementary fusion strategy to integrate spatial and channel attention. Spatial attention leverages strip pooling to model long-range dependencies, while channel attention employs dynamic encoding to refine features. By utilizing a grouped processing approach, the SFA module achieves an optimal balance between computational efficiency and representation power. Extensive experiments on benchmark datasets demonstrate that SFA consistently outperforms state-of-the-art attention mechanisms, delivering competitive accuracy in image classification, object detection, and semantic segmentation tasks while maintaining reduced model complexity. This work underscores the potential of lightweight attention mechanisms in modern computer vision and paves the way for further innovations in resource-efficient neural network design. Our code is publicly available at the following URL: https://github.com/Xuwei86/SFA

  • Research Article
  • Cite Count Icon 40
  • 10.1109/lsp.2021.3084855
Two-Stage Cascaded Decoder for Semantic Segmentation of RGB-D Images
  • Jan 1, 2021
  • IEEE Signal Processing Letters
  • Yuchun Yue + 3 more

Exploiting RGB and depth information can boost the performance of semantic segmentation. However, owing to the differences between RGB images and the corresponding depth maps, such multimodal information should be effectively used and combined. Most existing methods use the same fusion strategy to explore multilevel complementary information at various levels, likely ignoring different feature contributions at various levels for segmentation. To address this problem, we propose a network using a two-stage cascaded decoder (TCD), embedding a detail polishing module, to effectively integrate high- and low-level features and suppress noise from low-level details. Additionally, we introduce a depth filter and fusion module to extract informative regions from depth cues with the guidance of RGB images. The proposed TCD network achieves comparable performance to state-of-the-art RGB-D semantic segmentation methods on the benchmark NYUDv2 and SUN RGB-D datasets.

  • Conference Article
  • 10.1109/icpeca56706.2023.10075770
Research on Semantic Segmentation of Airborne LiDAR Point Cloud Based on Spatial Position Attention Mechanism
  • Jan 29, 2023
  • Zeyu Tian + 2 more

Airborne lidar detection technology can quickly and efficiently observe the earth, actively, real-time and directly obtain the three-dimensional information of a large range of ground objects, and generate a large range of LiDAR point clouds. However, complex categories and uneven heights of ground objects lead to difficulty on the semantic segmentation. To overcome this difficulty, this paper proposes an new semantic segmentation network for point clouds based on the spatial position attention mechanism and deep learning network. This network can directly process original point cloud without converting the original point cloud data into groups of 2D feature images or 3D voxel grids, avoiding information loss. The semantic segmentation model uses the encoder and decoder to extract multiple scale features, and uses the multiple layer perceptron to realize high precision segmentation. By the spatial position attention mechanism, this network can strengthen or weaken weights of the convolution kernel to automatically adapt to spatial structures of point cloud objects. In this paper, the ISPRS dataset provided by the International Society for Photogrammetry and Remote Sensing is used to carry out experiments. The experiment results indicate the proposed network can effectively identify various ground objects and has higher semantic segmentation accuracy than other popular methods.

  • Book Chapter
  • Cite Count Icon 5
  • 10.1007/978-3-031-26293-7_25
Causal-SETR: A SEgmentation TRansformer Variant Based on Causal Intervention
  • Jan 1, 2023
  • Wei Li + 1 more

We present a novel SEgmentaion TRansformer variant based on causal intervention. It serves as an improved vision encoder for semantic segmentation. Many studies have proved that vision transformers (ViT) can achieve a competitive benchmark on these downstream tasks, which shows that they can learn feature representations well. In other words, it is good at observing the instance from the image. However, in the human visual system, to recognize the objects in the scene, it is necessary to observe the objects themselves and introduce some prior knowledge for producing higher confidence results. Inspired by this, we introduced a structural causal model (SCM) to model images, category labels, and context. Beyond observing, we propose a causal intervention method by removing the confounding bias of global context and plugging it in the ViT encoder. Unlike other sequence-to-sequence prediction tasks, we use causal intervention instead of likelihood. Besides, the proxy training objective of the framework is to predict the contextual objects of a region. Finally, we combine this encoder with the segmentation decoder. Experiments show that our proposed method is flexible and effective.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant