Abstract

The combination of visible and thermal images has been proven to be effective in improving accuracy for crowd counting in illumination-unconstrained scenes. However, the challenging problem of misalignment in RGB-T image pairs has not been extensively explored in this context. This study aims to address the issue of misalignment between RGB and thermal image pairs to enhance the counting accuracy of cross-modal models. Specifically, we propose CrowdAlign, a cross-modal feature alignment fusion network that utilizes a shared-weight strategy for efficient feature extraction. Additionally, CrowdAlign addresses alignment adjustments through two stages: pre-fusion and post-fusion alignment. For pre-fusion feature extraction, we design a dual-level spatial-semantic parallel alignment module, while for post-fusion feature extraction, a low-frequency feature attention filtering module is developed. This two-stage alignment approach enables cross-modal feature alignment without requiring additional supervision. Experiments on the public benchmarks demonstrate that our method is effective under RGB-T misalignment or dark conditions. We hope CrowdAlign will inspire researchers to focus on and explore the issue of misalignment between RGB image and thermal image for cross-modal crowd counting.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.