Spatial-Channel Collaborated Attention for Cross-Scale Crowd Counting

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Large-scale variations in crowd images pose significant challenges to crowd counting. Recently, the widely adopted vision Transformer has been extensively used to extract spatial cross-scale features from crowd images. However, the interaction of features in the channel dimension is frequently overlooked, despite its potential benefits for learning cross-scale features. In this paper, we introduce a Spatial-Channel Collaborated Attention Network (SCCAN) to address the challenge of large-scale variations in crowd counting. Specifically, given feature maps of the crowd image, we exploit the inter-channel correlation of the features through parallel Channel Pooling Attention (CPA) and Channel Self-Attention (CSA). Then we utilize the proposed Scale-aware Window Self-Attention (SWSA) to learn cross-scale spatial features. The extracted features are subsequently fed into a convolutional regression head to predict the density map of the crowd. We conduct extensive experiments on public crowd counting datasets including ShanghaiTech A, UCF-QNRF, and JHU++. The experimental results demonstrate the superiority of our method.

Save Icon
Up Arrow
Open/Close