Semantic Aggregation Research Articles

Vision Transformer (ViT) is widely used in the field of computer vision, in ViT, there are four main steps, which are “four secrets”, such as patch division, token selection, position encoding addition, attention calculation, the existing research on transformer in computer vision mainly focuses on the above four steps. Therefore, “how to divide patch?”, “how to select token?”, “how to add position encoding?”, and “how to calculate attention?” are crucial to improve ViT performance. But so far, most of the review literatures are summarized from the perspective of application, and there is no corresponding literature to comprehensively summarize these four steps from the technology perspective, which restricts the further development of ViT in some degree. To address the above questions, the 4 major mechanisms and 5 applications of ViT are summarized in this paper, the main innovative works are as follows: Firstly, the basic principle and model structure of ViT are elaborated; Secondly, aiming to “how to divide patch?”, the 5 key techniques of patch division mechanism are summarized: from single-size division to multi-size division, from fixed number division to adaptive number division, from non-overlapping division to overlapping division, from semantic segmentation division to semantic aggregation division, and from original image division to feature map division; Thirdly, aiming to “how to select token?”, the 3 key techniques of token selection mechanism are summarized: token selection based on score, token selection based on merge, token selection based on convolution and pooling; Fourthly, aiming to “how to add position encoding?”, the 5 key techniques of position encoding mechanism are summarized: absolute position encoding, relative position encoding, conditional position encoding, locally-enhanced position encoding, and zero-padding position encoding; Fifthly, aiming to “how to calculate attention?”, 18 attention mechanisms are summarized based on the timeline; Sixthly, these models that Transformer is combined with U-Net, GAN, YOLO, ResNet, and DenseNet are discussed in the medical image processing field; Finally, around these four questions proposed in this paper, we look forward to the future development direction of frontier technologies such as patch division mechanism, token selection mechanism, position encoding mechanism, and attention mechanism et al, which play an important role in the further development of ViT.

Urban scene image segmentation is an important research area in high-resolution remote sensing image processing. However, due to its complex three-dimensional structure, interference factors such as occlusion, shadow, intra-class inconsistency, and inter-class indistinction affect segmentation performance. Many methods have combined local and global information using CNNs and Transformers to achieve high performance in remote sensing image segmentation tasks. However, these methods are not stable when dealing with these interference factors. Recent studies have found that semantic segmentation is highly sensitive to frequency information, so we introduced frequency information to make the model learn more comprehensively about different categories of targets from multiple dimensions. By modeling the target with local features, global information, and frequency information, the target features can be learned in multiple dimensions to reduce the impact of interference factors on the model and improve its robustness. In this paper, we consider frequency information in addition to combining CNNs and Transformers for modeling and propose a Multidimensional Information Fusion Network (MIFNet) for high-resolution remote sensing image segmentation of urban scenes. Specifically, we design an information fusion Transformer module that can adaptively associate local features, global semantic information, and frequency information and a relevant semantic aggregation module for aggregating features at different scales to construct the decoder. By aggregating image features at different depths, the specific representation of the target and the correlation between targets can be modeled in multiple dimensions, allowing the network to better recognize and understand the features of each class of targets to resist various interference factors that affect segmentation performance. We conducted extensive ablation experiments and comparative experiments on the ISPRS Vaihingen and ISPRS Potsdam benchmarks to verify our proposed method. In a large number of experiments, our method achieved the best results, with 84.53% and 87.3% mIoU scores on the Vaihingen and Potsdam datasets, respectively, proving the superiority of our method. The source code will be available at https://github.com/JunyuFan/MIFNet.

Semantic Aggregation Research Articles

Related Topics

Articles published on Semantic Aggregation

PDeT: A Progressive Deformable Transformer for Photovoltaic Panel Defect Segmentation

FMD-UNet: fine-grained feature squeeze and multiscale cascade dilated semantic aggregation dual-decoder UNet for COVID-19 lung infection segmentation from CT images

SWFormer: A scale-wise hybrid CNN-Transformer network for multi-classes weed segmentation

Precise facial landmark detection by Dynamic Semantic Aggregation Transformer

Automatic segmentation of spine x-ray images based on multiscale feature enhancement network.

SaraNet: Semantic aggregation reverse attention network for pulmonary nodule segmentation

HSAA-CD: A Hierarchical Semantic Aggregation Mechanism and Attention Module for Non-Agricultural Change Detection in Cultivated Land

Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation

Robust depth completion based on Semantic Aggregation

Multi-axis interactive multidimensional attention network for vehicle re-identification

Improvement of MQTT semantic to minimize data flow in IoT platforms based on distributed brokers

A novel transformer-based aggregation model for predicting gene mutations in lung adenocarcinoma.

Vision transformer: To discover the “four secrets” of image patches

Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label Classification

Smoke-Aware Global-Interactive Non-local Network for Smoke Semantic Segmentation.

Drug repositioning via Multi-view Representation Learning with Heterogeneous Graph Neural Network.

D-SAT: dual semantic aggregation transformer with dual attention for medical image segmentation

Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation

Collaborative Semantic Aggregation and Calibration for Federated Domain Generalization

AMCFNet: Asymmetric multiscale and crossmodal fusion network for RGB-D semantic segmentation in indoor service robots

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Semantic Aggregation Research Articles

Related Topics

Articles published on Semantic Aggregation

PDeT: A Progressive Deformable Transformer for Photovoltaic Panel Defect Segmentation

FMD-UNet: fine-grained feature squeeze and multiscale cascade dilated semantic aggregation dual-decoder UNet for COVID-19 lung infection segmentation from CT images

SWFormer: A scale-wise hybrid CNN-Transformer network for multi-classes weed segmentation

Precise facial landmark detection by Dynamic Semantic Aggregation Transformer

Automatic segmentation of spine x-ray images based on multiscale feature enhancement network.

SaraNet: Semantic aggregation reverse attention network for pulmonary nodule segmentation

HSAA-CD: A Hierarchical Semantic Aggregation Mechanism and Attention Module for Non-Agricultural Change Detection in Cultivated Land

Trash to Treasure: Low-Light Object Detection via Decomposition-and-Aggregation

Robust depth completion based on Semantic Aggregation

Multi-axis interactive multidimensional attention network for vehicle re-identification

Improvement of MQTT semantic to minimize data flow in IoT platforms based on distributed brokers

A novel transformer-based aggregation model for predicting gene mutations in lung adenocarcinoma.

Vision transformer: To discover the “four secrets” of image patches

Deep Matrix Factorization With Complementary Semantic Aggregation for Micro-Video Multi-Label Classification

Smoke-Aware Global-Interactive Non-local Network for Smoke Semantic Segmentation.

Drug repositioning via Multi-view Representation Learning with Heterogeneous Graph Neural Network.

D-SAT: dual semantic aggregation transformer with dual attention for medical image segmentation

Frequency-aware robust multidimensional information fusion framework for remote sensing image segmentation

Collaborative Semantic Aggregation and Calibration for Federated Domain Generalization

AMCFNet: Asymmetric multiscale and crossmodal fusion network for RGB-D semantic segmentation in indoor service robots