This paper aims to address the issue of traditional art patterns not fully leveraging contextual information in image semantic segmentation. It proposes a multi-level feature aggregation method for image semantic segmentation based on visual Transformer and coordinate attention adjustment (CA Adjustment). Initially, the input image is segmented into slices for linear projection, incorporating learnable positional embeddings to derive an encoded input sequence. Subsequently, employing a visual Transformer-based encoder, the image is transformed into patches to capture global context across the network. CA Adjustment is then introduced to adaptively merge fusion features and visual backbone features using weighted adjustments. Lastly, a multi-objective grasshopper optimization algorithm is designed based on a clustering evolution mechanism to further enhance algorithm performance. Extensive experimental results demonstrate that our proposed algorithm effectively models global contextual information for image feature extraction. Compared to existing advanced algorithms, our method achieves high segmentation accuracy in semantic segmentation tasks, consistently exceeding 95%. Ablation experiments further validate the efficacy of our approach, underscoring its broad applicability in high-precision image semantic segmentation.
Read full abstract