Abstract

Multi-modal crowd counting aims at using multiple types of data, like RGB-Thermal and RGB-Depth, to count the number of people in crowded scenes. Current methods mainly focus on two-stream multi-modal information fusing in the encoder and single-scale semantic features in the decoder. In this paper, we propose an end-to-end three-stream fusion and self-differential attention network to simultaneously address the multi-modal fusion and scale variation problems for multi-modal crowd counting. Specifically, the encoder adopts three-stream fusion to fuse stage-wise modality-paired and modality-specific features. The decoder applies a self-differential attention mechanism on multi-level fused features to extract basic and differential information adaptively, and finally, the counting head predicts the density map. Experimental results on RGB-T and RGB-D benchmarks show the superiority of our proposed method compared with the state-of-the-art multi-modal crowd counting methods. Ablation studies and visualization demonstrate the advantages of the proposed modules in our model.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call