A Deformable Attention Network for High-Resolution Remote Sensing Images Semantic Segmentation

Renxiang Zuo,Guangyun Zhang,Xiuping Jia,Rongting Zhang

doi:10.1109/tgrs.2021.3119537

Abstract

Deformable convolutional networks (DCNs) can mitigate the inherent limited geometric transformation. We reformulate the spatialwise attention mechanism using DCNs in this article for semantic segmentation of high-resolution remote sensing (HRRS) images. It combines the sparse spatial sampling strategy and the long-range relationship modeling capability, namely, deformable attention module (DAM). Such locality awareness, more adaptable to HRRS image structures, can capture each pixel’s neighboring structural information. A reasonable multiscale deformable attention net (MDANet) is designed for the HRRS image semantic segmentation with a slightly increased computational cost based on the proposed DAM. Specifically, standard convolutional layers in the raw ResNet50 are equipped with a DAM to control sampling over a broader range of feature levels and aggregate multiscale context information. The experimental results evaluated on Vaihingen and DeepGlobe Land Cover Classification datasets show that the performance accuracy of MDANet is improved by 7.77% and 8.45% compared with the backbone network (ResNet50) in terms of Miou evaluation, respectively. Furthermore, a DAM can perform better than a global spatial attention mechanism with less computation on the <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$3 \times 64 \times 64$ </tex-math></inline-formula> feature map. In addition, the added ablation studies demonstrate the effectiveness and efficiency of the DAM and multiscale strategy, respectively. Moreover, the sensitivity of critical hyperparameters is analyzed.

Full Text