Transformer-CNN hybrid network for crowd counting

Jiamao Yu,Jiamao Yu,Jin Qian,Zhiliang Zhu,Jin Qian,Ying Yu,Ying Yu,Feng Zhu,Xing Han,Xing Han,Feng Zhu

doi:10.3233/jifs-236370

Abstract

Efficient feature representation is the key to improving crowd counting performance. CNN and Transformer are the two commonly used feature extraction frameworks in the field of crowd counting. CNN excels at hierarchically extracting local features to obtain a multi-scale feature representation of the image, but it struggles with capturing global features. Transformer, on the other hand, could capture global feature representation by utilizing cascaded self-attention to capture remote dependency relationships, but it often overlooks local detail information. Therefore, relying solely on CNN or Transformer for crowd counting has certain limitations. In this paper, we propose the TCHNet crowd counting model by combining the CNN and Transformer frameworks. The model employs the CMT (CNNs Meet Vision Transformers) backbone network as the Feature Extraction Module (FEM) to hierarchically extract local and global features of the crowd using a combination of convolution and self-attention mechanisms. To obtain more comprehensive spatial local information, an improved Progressive Multi-scale Learning Process (PMLP) is introduced into the FEM, guiding the network to learn at different granularity levels. The features from these three different granularity levels are then fed into the Multi-scale Feature Aggregation Module (MFAM) for fusion. Finally, a Multi-Scale Regression Module (MSRM) is designed to handle the multi-scale fused features, resulting in crowd features rich in high-level semantics and low-level detail. Experimental results on five benchmark datasets demonstrate that TCHNet achieves highly competitive performance compared to some popular crowd counting methods.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Transformer-CNN hybrid network for crowd counting

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent & Fuzzy Systems

Lead the way for us

Similar Papers

Fashion clothing matching by global-local feature optimization
Yunzhu Wang ... Xiaodong Fu
Journal of Image and Graphics | VOL. 28
Yunzhu Wang, et. al.Yunzhu Wang ... Xiaodong Fu
01 Jan 2023
Journal of Image and Graphics | VOL. 28

Medical Image Segmentation with Dual-Encoding and Multi-Level Feature Adaptive Fusion
Shulei Wu ... You Yang
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 38
Shulei Wu, et. al.Shulei Wu ... You Yang
30 Mar 2024
International Journal of Pattern Recognition and Artificial Intelligence | VOL. 38

LocalFace: Learning significant local features for deep face recognition
Xiao Ke ... Binghui Lin
Image and Vision Computing | VOL. 123
Xiao Ke, et. al.Xiao Ke ... Binghui Lin
01 Jul 2022
Image and Vision Computing | VOL. 123

CVTrack: Combined Convolutional Neural Network and Vision Transformer Fusion Model for Visual Tracking.
Jian Wang ... Ce Song
Sensors (Basel, Switzerland) | VOL. 24
Jian Wang, et. al.Jian Wang ... Ce Song
03 Jan 2024
Sensors (Basel, Switzerland) | VOL. 24

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Transformer-CNN hybrid network for crowd counting

Abstract

Talk to us

Similar Papers

More From: Journal of Intelligent &amp; Fuzzy Systems

More From: Journal of Intelligent & Fuzzy Systems