Abstract

RGB-T crowd counting aims to utilize the thermal source data to compensate for the insufficient RGB feature representation under low-illumination conditions. However, how to efficiently use complementary information from different modalities is still a challenging issue for the target task. Existing CNN-based RGB-T crowd counting methods focus on the cross-modal feature representation via the local fixed-receptive-field convolutional operations and the transformer-based works mainly utilize the mutual attention, failing to realize the global cross-modal feature representation. To this end, this paper designs a multimodal counting network based on the multimodal transformer mixer to realize cross-modal collaborative feature representation. Firstly, we design a Transformer-based multimodal mixer to fully fuse the features from both modalities. And then we utilize the Transformer-based network with the multi-head self-attention layer replaced by the average pooling layer to build the two-stream backbone to extract rich multi-scale feature information of the two modalities. Meanwhile, the features extracted from the mixers are used to enhance the intermediate features to achieve the preservation of crowd detail information and suppression of background information. Finally, we design a pyramid regression head to aggregate the multi-scale feature maps to regress the density maps. Extensive experiments, as well as ablation studies, are conducted on challenging RGB-T crowd counting benchmark datasets (RGBT-CC and DroneRGBT), and the achieved competitive experimental results demonstrate the effectiveness of the proposed method on cross-modal collaborative feature representation for target task.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call