Spatial-Channel Transformer for Scene Recognition

Seunghyun Baik,Youngjo Lee,Euntai Kim,Hongje Seong

doi:10.1109/ijcnn55064.2022.9891998

Abstract

Despite the great success of attention mechanisms on object recognition, scene recognition remains a challenging problem. The reason is that discriminative regions are not evident in a scene image. For example, a tree in an image can be a cue to recognize a scene, but the tree cannot be the only cue for recognizing the scene. That means several scene categories (e.g. mountain, marsh, and river) can contain a tree. Thus sometimes, overall regions, rather than specific regions, need to be considered for scene recognition. To solve the problem, we propose Spatial-Channel Transformer (SC-Transformer). The SC-Transformer is a simple yet effective module that uses a new attention mechanism by incorporating the importance between the spatial and the channel domain for a given scene image. If the given scene image should be considered only within some specific regions, SC-Transformer turns off the channel attention, and vice versa. Furthermore, the attention mechanism used in our proposed method is advanced from previous approaches. Previous spatial and channel attention mechanisms were designed in a sequential or parallel manner. These mechanisms eventually combine spatial and channel attention together, so spatial and channel attention may often interfere with each other. In contrast to the previous works, we present a new mechanism that simultaneously considers spatial and channel attentions. We validate our approach on a large-scale scene recognition dataset and outperform the previous state-of-the-art spatial-channel attention mechanism. Experimental results demonstrate the efficacy of our attention mechanism for scene recognition.

Full Text