Learning the cross-modal discriminative feature representation for RGB-T crowd counting

He Li,Shihui Zhang,Weihang Kong

doi:10.1016/j.knosys.2022.109944

Abstract

To reduce the interference of the arbitrary crowd distribution and complex background on the counting accuracy in the unconstrained scene, this paper presents a novel RGB-T crowd counting approach based on the cross-modal discriminative feature representation learning, including two main stages. Specially, the first stage explicitly establishes a non-linear mapping from the RGB domain to the thermal domain so as to learn the cross-modal discriminative feature representation about the crowd distribution, given that the thermal image of the crowded scene presents more intuitive information related to the crowd and would be more insensitive to the background element than the conventional optical image. The second stage incorporates the complementary features of the RGB and thermal image pair based on combining the learned feature representation from the first stage to yield the final counting result. Experiments on the RGB-T crowd counting benchmark verify the superiority of the proposed approach with the state-of-the-art methods. The ablation studies on RGB-T and the evaluation on RGB crowd counting benchmarks validate the effectiveness of the designed cross-modal discriminative feature representation learning. Experimental results demonstrate that the proposed approach could obviate the need of relying on the specific structure of modality-specific feature extraction parts, rather than the conventional works of treating the paired RGB-thermal image as the indiscriminative information source and fusing them directly. The proposed approach in this study could realize the cross-modal discriminative feature representation learning in an efficient way of establishing cross-modal domain explicitly for the algorithm research of the intelligent surveillance system development.

Full Text