Nonlocal spatial attention module for image classification

Qinglin Zhang,Bingling Chen,Qiaoqiao Xia,Yan Huang

doi:10.1177/1729881420938927

Qinglin Zhang, Bingling Chen + Show 2 more

Open Access

https://doi.org/10.1177/1729881420938927

Copy DOI

Abstract

To enhance the capability of neural networks, research on attention mechanism have been deepened. In this area, attention modules make forward inference along channel dimension and spatial dimension sequentially, parallelly, or simultaneously. However, we have found that spatial attention modules mainly apply convolution layers to generate attention maps, which aggregate feature responses only based on local receptive fields. In this article, we take advantage of this finding to create a nonlocal spatial attention module (NL-SAM), which collects context information from all pixels to adaptively recalibrate spatial responses in a convolutional feature map. NL-SAM overcomes the limitations of repeating local operations and exports a 2D spatial attention map to emphasize or suppress responses in different locations. Experiments on three benchmark datasets show at least 0.58% improvements on variant ResNets. Furthermore, this module is simple and can be easily integrated with existing channel attention modules, such as squeeze-and-excitation and gather-excite, to exceed these significant models at a minimal additional computational cost (0.196%).

Highlights

By interleaving a series of convolutional layers with nonlinear activation functions and downsample operators, convolutional neural networks (CNNs)[1] are able to produce robust representations that capture hierarchical patterns and attain global theoretical receptive field
CNNs become the paradigm of choice in many computer vision applications, such as image classification,[2,3,4,5] object detection,[6] semantic segmentation,[7] and regression.[8,9]
One approach is that nonlocal network (NLNet)[12] presents a self-attention map to model the correspondence from all positions to each query position

Summary

Introduction

By interleaving a series of convolutional layers with nonlinear activation functions and downsample operators, convolutional neural networks (CNNs)[1] are able to produce robust representations that capture hierarchical patterns and attain global theoretical receptive field. CNNs become the paradigm of choice in many computer vision applications, such as image classification,[2,3,4,5] object detection,[6] semantic segmentation,[7] and regression.[8,9] In recent years, attention mechanisms have been a new remedy for feature recalibration by capturing contextual longrange interactions. Self-attention mechanism is to measure the compatibility of the query and key content pairwise relations. In this field, one approach is that nonlocal network (NLNet)[12] presents a self-attention map to model the correspondence from all positions to each query position. Our nonlocal spatial attention module (NL-SAM) builds on the benefits of NLNet with effective modeling on global contextual information and CBAM with an efficient attention map generation. The second section analyzes related works about attention mechanisms.

Related works

56 Â 56 stage 2

Findings

Analysis and discussion

Conclusion