Spatial Plaid Attention Decoder for Semantic Segmentation
Striking a balance between efficiency and accuracy is a challenge in the design and implementation of decoders. Accurate decoders often tend to be highly complex and computationally costly. This paper presents a novel decoder for semantic segmentation: Spatial Plaid Attention Decoder (SPADe). We propose a Spatial Plaid Attention module that performs an efficient local feature collection through spatial feature folding to help the model capture the local structure of objects while using long-range feature aggregation to consider the global structures efficiently and accurately. This makes SPADe a suitable choice for applications with limited resources. With a size of 7.3% UPerNet decoder, our SPADe obtains state-of-the-art performance with several popular backbones on public benchmarks. On Cityscapes and ADE20K, SPADe obtains 84.3% and $53.8 \% \mathrm{mIoU}$, while reducing the total GFlops by 32.8% and 70.5%, respectively. We also demonstrate that the effective design of SPADe allows it to capture long-range dependencies with a large receptive field. Implementation of the SPADe is available at github/SPADe.
- Conference Article
4
- 10.1109/iv51971.2022.9827377
- Jun 5, 2022
Semantic segmentation is vital for autonomous car scene understanding. It provides more precise subject information than raw RGB images and this, in turn, boosts the performance of autonomous driving. Recently, self-attention methods show great improvement in image semantic segmentation. Attention maps help scene parsing with abundant relationships of every pixel in an image. However, it is computationally demanding. Besides, existing works focus either on channel attention, ignoring the pixel position factors, or on spatial attention, disregarding the impacts of the channels on each other. To address these problems, we present Fusion Attention Network based on self-attention mechanism to harvest rich contextual dependencies. This model consists of two chains: pyramid fusion spatial attention and fusion channel attention. We apply pyramid sampling in the spatial attention module to reduce the computation for spatial attention maps. Channel attention has a similar structure to the spatial attention. We also introduce a fusion technique to calculate contextual dependencies using features from both attention chains. We concatenate the results from spatial and channel attention modules as the enhanced attention map, leading to better semantic segmentation results. We conduct extensive experiments on popular datasets with different settings in addition to an ablation study to prove the efficiency of our approach. Our model achieves better results, on Cityscapes [7], compared to state-of-the-art methods, and also show good generalization capability on PASCAL VOC 2012 [9].
- Research Article
10
- 10.1016/j.bspc.2024.106163
- Feb 28, 2024
- Biomedical Signal Processing and Control
Spider-Net: High-resolution multi-scale attention network with full-attention decoder for tumor segmentation in kidney, liver and pancreas
- Research Article
2
- 10.1016/j.eswa.2021.116438
- Dec 31, 2021
- Expert Systems With Applications
Pixel Voting Decoder: A novel decoder that regresses pixel relationships for segmentation
- Conference Article
1
- 10.1145/3373509.3373535
- Oct 23, 2019
Thanks to recent development in CNNs and deep learning, solid improvements have been made in semantic segmentation, however, most of the previous works in semantic segmentation are for automatic driving and do not fully taken into account the specific difficulties that exist in high resolution remote sensing imagery. One of such difficulties is that objects are small, crowded and intra-class scale difference in remote sensing imagery. To tackle with this challenging task, we have proposed a novel architecture which adopts encoder-decoder structure, multi-scale dilated convolution with spatial attention and separable convolution (Global Attention Pyramid) and channel attention decoder (Attention Decoder). The proposed Global Attention Pyramid module solves these problems by enlarging receptive field without reducing resolution of feature maps and pixel-level attention. And the proposed Attention Decoder module solves these problems by providing global context to select category localization details. We tested our network on two satellite imagery datasets and acquired remarkably good results for both datasets especially for small objects. And our new network improves the performance from 0.6341 to 0.6510 in DEEPGLOBE road extraction dataset.
- Conference Article
10
- 10.5220/0007366003930400
- Jan 1, 2019
Semantic segmentation remains a computationally intensive algorithm for embedded deployment even with the rapid growth of computation power. Thus efficient network design is a critical aspect especially for applications like automated driving which requires real-time performance. Recently, there has been a lot of research on designing efficient encoders that are mostly task agnostic. Unlike image classification and bounding box object detection tasks, decoders are computationally expensive as well for semantic segmentation task. In this work, we focus on efficient design of the segmentation decoder and assume that an efficient encoder is already designed to provide shared features for a multi-task learning system. We design a novel efficient non-bottleneck layer and a family of decoders which fit into a small run-time budget using VGG10 as efficient encoder. We demonstrate in our dataset that experimentation with various design choices led to an improvement of 10% from a baseline performance.
- Research Article
47
- 10.1609/aaai.v37i2.25321
- Jun 26, 2023
- Proceedings of the AAAI Conference on Artificial Intelligence
With the success of Vision Transformer (ViT) in image classification, its variants have yielded great success in many downstream vision tasks. Among those, the semantic segmentation task has also benefited greatly from the advance of ViT variants. However, most studies of the transformer for semantic segmentation only focus on designing efficient transformer encoders, rarely giving attention to designing the decoder. Several studies make attempts in using the transformer decoder as the segmentation decoder with class-wise learnable query. Instead, we aim to directly use the encoder features as the queries. This paper proposes the Feature Enhancing Decoder transFormer (FeedFormer) that enhances structural information using the transformer decoder. Our goal is to decode the high-level encoder features using the lowest-level encoder feature. We do this by formulating high-level features as queries, and the lowest-level feature as the key and value. This enhances the high-level features by collecting the structural information from the lowest-level feature. Additionally, we use a simple reformation trick of pushing the encoder blocks to take the place of the existing self-attention module of the decoder to improve efficiency. We show the superiority of our decoder with various light-weight transformer-based decoders on popular semantic segmentation datasets. Despite the minute computation, our model has achieved state-of-the-art performance in the performance computation trade-off. Our model FeedFormer-B0 surpasses SegFormer-B0 with 1.8% higher mIoU and 7.1% less computation on ADE20K, and 1.7% higher mIoU and 14.4% less computation on Cityscapes, respectively. Code will be released at: https://github.com/jhshim1995/FeedFormer.
- Conference Article
2
- 10.1109/imcom51814.2021.9377401
- Jan 4, 2021
In this research, a multitask convolutional neural network that can do end-to-end road scene classification and semantic segmentation, which are the two crucial tasks for advanced driver assistance systems (ADAS), is proposed. We name the network TSS which means time-based semantic segmentation. The network contains three main modules: an image encoder, a scene classifier, and two time-based segmentation decoders. For each road scene image, the encoder extracts image features which will be used for classifier and decoders. Next, the image features are fed to the classifier to predict the scene type (in this case a day or a night scene). Then, based on the predicted scene type, the same extracted features are fed to a corresponding segmentation decoder to produce the final semantic segmentation result. By using this classification-driven decoder approach, we can improve the accuracy of the segmentation model, even when the model has been trained excessively earlier. Through the experiment, the validity of our proposed method has been proven. Our approach can be considered as stacking multiple segmentation modules on top of the classification module with all of them share the same image encoder. With this approach, we can utilize the result from classification to gain more accuracy in segmentation in one feed forward only.
- Conference Article
3
- 10.24963/ijcai.2023/74
- Aug 1, 2023
Semi-supervised semantic segmentation methods are the main solution to alleviate the problem of high annotation consumption in semantic segmentation. However, the class imbalance problem makes the model favor the head classes with sufficient training samples, resulting in poor performance of the tail classes. To address this issue, we propose a Decoupled Semi-Supervise Semantic Segmentation (DeS4) framework based on the teacher-student model. Specifically, we first propose a decoupling training strategy to split the training of the encoder and segmentation decoder, aiming at a balanced decoder. Then, a non-learnable prototype-based segmentation head is proposed to regularize the category representation distribution consistency and perform a better connection between the teacher model and the student model. Furthermore, a Multi-Entropy Sampling (MES) strategy is proposed to collect pixel representation for updating the shared prototype to get a class-unbiased head. We conduct extensive experiments of the proposed DeS4 on two challenging benchmarks (PASCAL VOC 2012 and Cityscapes) and achieve remarkable improvements over the previous state-of-the-art methods.
- Research Article
35
- 10.1609/aaai.v36i1.19985
- Jun 28, 2022
- Proceedings of the AAAI Conference on Artificial Intelligence
Spatial and channel attentions, modelling the semantic interdependencies in spatial and channel dimensions respectively, have recently been widely used for semantic segmentation. However, computing spatial and channel attentions separately sometimes causes errors, especially for those difficult cases. In this paper, we propose Channelized Axial Attention (CAA) to seamlessly integrate channel attention and spatial attention into a single operation with negligible computation overhead. Specifically, we break down the dot-product operation of the spatial attention into two parts and insert channel relation in between, allowing for independently optimized channel attention on each spatial location. We further develop grouped vectorization, which allows our model to run with very little memory consumption without slowing down the running speed. Comparative experiments conducted on multiple benchmark datasets, including Cityscapes, PASCAL Context, and COCO-Stuff, demonstrate that our CAA outperforms many state-of-the-art segmentation models (including dual attention) on all tested datasets.
- Research Article
- 10.3390/app15042012
- Feb 14, 2025
- Applied Sciences
Effectively classifying each pixel in an image is an important research topic in semantic segmentation. The Existing methods typically require the network to directly generate a feature map of the same size as the original image and classify each pixel, which makes it difficult for the network to fully leverage the representations from the backbone. To handle this challenge, this paper proposes a method named mask classification combined with an adaptive dilated convolution network (MCDCNet). Firstly, a Vision Transformer (ViT)-based module is employed to capture contextual features as the backbone. Secondly, the Spatial Extraction Module (SEM) is proposed to extract multi-scale spatial information through adaptive dilated convolution while preserving the original feature size. This spatial information is then integrated into the corresponding contextual features to enhance the representation. Finally, a novel inference process is proposed that incorporates the instance activation map (IAM)-based decoder for semantic segmentation, thereby enhancing the network’s capability to capture and comprehend semantic features. The experimental results demonstrate that our network significantly outperforms other per-pixel classification networks across several semantic segmentation datasets. In particular, on Cityscapes, MCDCNet achieves 80.3 mIoU with 11.8 M Params, demonstrating that the network is able to deliver a strong segmentation performance while maintaining a relatively low parameter count.
- Conference Article
5
- 10.1109/ieeeconf44664.2019.9048981
- Nov 1, 2019
Utilizing channel-wise spatial attention mechanisms to emphasize special parts of an input image is an effective method to improve the performance of convolutional neural networks (CNNs). There are multiple effective implementations of attention mechanism. One is adding squeeze-and-excitation (SE) blocks to the CNN structure that selectively emphasize the most informative channels and suppress the relatively less informative channels by taking advantage of channel dependence. Another method is adding convolutional block attention module (CBAM) to implement both channel-wise and spatial attention mechanisms to select important pixels of the feature maps while emphasizing informative channels. In this paper, we propose an encoder-decoder architecture based on the idea of letting the channel-wise and spatial attention blocks share the same latent space representation. Instead of separating the channel-wise and spatial attention modules into two independent parts in CBAM, we combine them into one encoder-decoder architecture with two outputs. To evaluate the performance of the proposed algorithm, we apply it to different CNN architectures and test it on image classification and semantic segmentation. Through comparing the resulting structure equipped with MEDA blocks against other attention module, we show that the proposed method achieves better performance across different test scenarios.
- Research Article
6
- 10.1007/s11063-025-11748-8
- Apr 4, 2025
- Neural Processing Letters
Attention mechanisms are critical tools for enhancing the performance of convolutional neural networks (CNNs), focusing on spatial and channel dimensions of feature maps, known as spatial attention and channel attention, respectively. While many advanced attention methods combine these dimensions to improve performance, particularly in downstream computer vision tasks, such methods often introduce significant computational overhead or fail to effectively capture long-range spatial dependencies alongside channel attention. To address these challenges, this paper proposes the sequential fusion attention (SFA) method, which introduces a complementary fusion strategy to integrate spatial and channel attention. Spatial attention leverages strip pooling to model long-range dependencies, while channel attention employs dynamic encoding to refine features. By utilizing a grouped processing approach, the SFA module achieves an optimal balance between computational efficiency and representation power. Extensive experiments on benchmark datasets demonstrate that SFA consistently outperforms state-of-the-art attention mechanisms, delivering competitive accuracy in image classification, object detection, and semantic segmentation tasks while maintaining reduced model complexity. This work underscores the potential of lightweight attention mechanisms in modern computer vision and paves the way for further innovations in resource-efficient neural network design. Our code is publicly available at the following URL: https://github.com/Xuwei86/SFA
- Research Article
40
- 10.1109/lsp.2021.3084855
- Jan 1, 2021
- IEEE Signal Processing Letters
Exploiting RGB and depth information can boost the performance of semantic segmentation. However, owing to the differences between RGB images and the corresponding depth maps, such multimodal information should be effectively used and combined. Most existing methods use the same fusion strategy to explore multilevel complementary information at various levels, likely ignoring different feature contributions at various levels for segmentation. To address this problem, we propose a network using a two-stage cascaded decoder (TCD), embedding a detail polishing module, to effectively integrate high- and low-level features and suppress noise from low-level details. Additionally, we introduce a depth filter and fusion module to extract informative regions from depth cues with the guidance of RGB images. The proposed TCD network achieves comparable performance to state-of-the-art RGB-D semantic segmentation methods on the benchmark NYUDv2 and SUN RGB-D datasets.
- Conference Article
- 10.1109/icpeca56706.2023.10075770
- Jan 29, 2023
Airborne lidar detection technology can quickly and efficiently observe the earth, actively, real-time and directly obtain the three-dimensional information of a large range of ground objects, and generate a large range of LiDAR point clouds. However, complex categories and uneven heights of ground objects lead to difficulty on the semantic segmentation. To overcome this difficulty, this paper proposes an new semantic segmentation network for point clouds based on the spatial position attention mechanism and deep learning network. This network can directly process original point cloud without converting the original point cloud data into groups of 2D feature images or 3D voxel grids, avoiding information loss. The semantic segmentation model uses the encoder and decoder to extract multiple scale features, and uses the multiple layer perceptron to realize high precision segmentation. By the spatial position attention mechanism, this network can strengthen or weaken weights of the convolution kernel to automatically adapt to spatial structures of point cloud objects. In this paper, the ISPRS dataset provided by the International Society for Photogrammetry and Remote Sensing is used to carry out experiments. The experiment results indicate the proposed network can effectively identify various ground objects and has higher semantic segmentation accuracy than other popular methods.
- Book Chapter
5
- 10.1007/978-3-031-26293-7_25
- Jan 1, 2023
We present a novel SEgmentaion TRansformer variant based on causal intervention. It serves as an improved vision encoder for semantic segmentation. Many studies have proved that vision transformers (ViT) can achieve a competitive benchmark on these downstream tasks, which shows that they can learn feature representations well. In other words, it is good at observing the instance from the image. However, in the human visual system, to recognize the objects in the scene, it is necessary to observe the objects themselves and introduce some prior knowledge for producing higher confidence results. Inspired by this, we introduced a structural causal model (SCM) to model images, category labels, and context. Beyond observing, we propose a causal intervention method by removing the confounding bias of global context and plugging it in the ViT encoder. Unlike other sequence-to-sequence prediction tasks, we use causal intervention instead of likelihood. Besides, the proxy training objective of the framework is to predict the contextual objects of a region. Finally, we combine this encoder with the segmentation decoder. Experiments show that our proposed method is flexible and effective.