Abstract

The cross-modal crowd counting method demonstrates better scene adaptability under complex conditions by introducing independent supplementary information. However, existing methods still face problems such as insufficient fusion of modal features, underutilization of crowd structure, and the neglect of scale information. In response to the above issues, this paper proposes a cross-modal multi-scale perception network (CMPNet). Specifically, CMPNet mainly consists of a cross-modal perception fusion module and a multi-scale feature aggregation module. The cross-modal perception fusion module effectively suppresses noise features while sharing features between different modalities, thereby significantly improving the robustness of the crowd counting process. The multi-scale feature aggregation module obtains rich crowd structure information through a spatial context aware graph convolution unit, and then integrates feature information from different scales to enhance the network’s perception ability of crowd density. To the best of our knowledge, CMPNet is the first attempt to model the crowd structure and mine its semantics in the field of cross-modal crowd counting. The experimental results show that CMPNet achieves state-of-the-art performance on all RGB-T datasets, providing an effective solution for cross-modal crowd counting. We will release the code at https://github.com/KunChenKKK/CMPNet.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.