A cross-modal crowd counting method combining CNN and cross-modal transformer

Shihui Zhang,Weibo Zhao,Wei Wang,Lei Wang,Qunpeng Li

doi:10.1016/j.imavis.2022.104592

Abstract

Cross-modal crowd counting aims to use the information between different modalities to generate crowd density images, so as to estimate the number of pedestrians more accurately in unconstrained scenes. Due to the huge differences between different modal images, how to effectively fuse the information between different modalities is still a challenging problem. To address this problem, we propose a cross-modal crowd counting method based on CNN and novel cross-modal transformer, which effectively fuses the information between different modalities and boosts the accuracy of crowd counting in unconstrained scenes. Concretely, we first design double CNN branches to capture the modality-specific features of images. After that, we design a novel cross-modal transformer to extract cross-modal global features from the modality-specific features. Furthermore, we a propose cross layer connection structure to connect the front-end information and back-end information of the network by adding different layer features. At the end of the network, we develop a cross- modal attention module to strengthen the cross-modal feature representation by extracting the complementarities between different modal features. The experimental results show that the method combining CNN and novel cross-modal transformer proposed in this paper achieves state-of-the-art performance, which not only effectively improves the accuracy and robustness of cross-modal crowd counting, but also has good generalization under multimodal crowd counting.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A cross-modal crowd counting method combining CNN and cross-modal transformer

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing

Lead the way for us

Journal: Image and Vision Computing	Publication Date: Jan 1, 2023
Citations: 7

Similar Papers

Scanning the Issue
Azim Eskandarian
IEEE Transactions on Intelligent Transportation Systems | VOL. 22
Azim EskandarianAzim Eskandarian
01 Aug 2021
IEEE Transactions on Intelligent Transportation Systems | VOL. 22

Crowd Density Estimation Using Fusion of Multi-Layer Features
Xinghao Ding ... Yue Huang
IEEE Transactions on Intelligent Transportation Systems | VOL. 22
Xinghao Ding, et. al.Xinghao Ding ... Yue Huang
01 Aug 2021
IEEE Transactions on Intelligent Transportation Systems | VOL. 22

MPANet: A Multi-stage Pixel-Level Attention Network for Crowd Counting
Jiangjun Hu ... Yang Chen
Procedia Computer Science | VOL. 208
Jiangjun Hu, et. al.Jiangjun Hu ... Yang Chen
01 Jan 2021
Procedia Computer Science | VOL. 208

MSGSA: Multi-Scale Guided Self-Attention Network for Crowd Counting
Yange Sun ... Li Zhang
Electronics | VOL. 12
Yange Sun, et. al.Yange Sun ... Li Zhang
11 Jun 2023
Electronics | VOL. 12

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A cross-modal crowd counting method combining CNN and cross-modal transformer

Abstract

Talk to us

Similar Papers

More From: Image and Vision Computing