Cross-scale Vision Transformer for crowd localization

Shuang Liu,Yu Lian,Zhong Zhang,Baihua Xiao,Tariq S Durrani

doi:10.1016/j.jksuci.2024.101972

Abstract

Crowd localization can provide the positions of individuals and the total number of people, which has great application value for security monitoring and public management, meanwhile it meets the challenges of lighting, occlusion and perspective effect. In recent times, Transformer has been applied in crowd localization to overcome these challenges. Yet such kind of methods only consider to integrate the multi-scale information once, which results in incomplete multi-scale information fusion. In this paper, we propose a novel Transformer network named Cross-scale Vision Transformer (CsViT) for crowd localization, which simultaneously fuses multi-scale information during both the encoder and decoder stages and meanwhile building the long-range context dependencies on the combined feature maps. To this end, we design the multi-scale encoder to fuse the feature maps of multiple scales at corresponding positions so as to obtain the combined feature maps, and meanwhile design the multi-scale decoder to integrate the tokens at multiple scales when modeling the long-range context dependencies. Furthermore, we propose Multi-scale SSIM (MsSSIM) loss to adaptively compute head regions and optimize the similarity at multiple scales. Specifically, we set the adaptive windows with different scales for each head and compute the loss values within these windows so as to enhance the accuracy of the predicted distance transform map. We perform comprehensive experiments on five public datasets, and the results obtained validate the effectiveness of our method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Cross-scale Vision Transformer for crowd localization

Abstract

Talk to us

Similar Papers

More From: Journal of King Saud University - Computer and Information Sciences

Lead the way for us

Journal: Journal of King Saud University - Computer and Information Sciences	Publication Date: Feb 1, 2024
License type: cc-by-nc-nd

Similar Papers

Hierarchical Attentional Feature Fusion for Surgical Instrument Segmentation.
Xiaowei Zhou ... Yue Guo
Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference | VOL. 2021
Xiaowei Zhou, et. al.Xiaowei Zhou ... Yue Guo
01 Nov 2021
01 Nov 2021

Enhanced Object Detection in Bird's Eye View Using 3D Global Context Inferred From Lidar Point Data
Yecheol Kim ... Jaekyum Kim
-
Yecheol Kim, et. al.Yecheol Kim ... Jaekyum Kim
01 Jun 2019
01 Jun 2019

CTNet: Contrastive Transformer Network for Polyp Segmentation.
Bin Xiao ... Xiuli Bi
IEEE transactions on cybernetics | VOL. 54
Bin Xiao, et. al.Bin Xiao ... Xiuli Bi
01 Jan 2024
IEEE transactions on cybernetics | VOL. 54

Multi-scale Spatial Location Preference for Semantic Segmentation
Qiuyuan Han ... Jin Zheng
-
Qiuyuan Han, et. al.Qiuyuan Han ... Jin Zheng
24 Dec 2019
24 Dec 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Cross-scale Vision Transformer for crowd localization

Abstract

Talk to us

Similar Papers

More From: Journal of King Saud University - Computer and Information Sciences