Abstract

Albeit recent cross-modal crowd counting methods have achieved promising performance, most of them only focus on how to combine the RGB modality and thermal modality, and ignore the adverse effect of the scale variation issue on the cross-modal feature fusion process, where the scale variation is extensively investigated in unimodal crowd counting but less-studied in RGB-T crowd counting. Therefore, this paper aims to explore the aggregated multi-scale context in both the modality-specific and modality-shared feature extraction. Specially, we firstly introduce a novel combining way of the multi-scale analysis and attention aggregation to realize a deeply aggregated scale-aware feature representation. Then based on this combining way, the Scale-aware channel attention aggregation(SCAA) module and Scale-aware Cross-modal Feature Aggregation(SCFA) module are designed for the modality-specific feature extraction and cross-modal feature fusion process. Finally, based on SCAA module and SCFA module, we construct our cross-modal crowd counting architecture, and importantly, the backbone structure of the counting model could be substituted by other structures and thus the SCAA and SCFA are plug-and-play modules. Extensive experiments on the challenging RGB-T and RGB-D crowd counting benchmarks demonstrate the proposed method realizes the state-of-the-art RGB-T counting performance and could also be extended into the RGB-D crowd counting task.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call