Abstract
CNN-based crowd counting methods have achieved great progress in recent years. However, most of these CNN-based crowd counting methods do not make full use of contextual information, which contains high-level semantic features and low-level detail features from different receptive fields of CNN. But rich contextual information is important to solve the scale variation problem of crowd counting. So the precision of previous CNN-based crowd counting methods is decreased. To solve this problem, we propose an adaptive attention fusion mechanism (AAFM). AAFM can use multi-scale features from different receptive fields of CNN effectively. It integrates the convolution network for feature learning and the attention mechanism for multi-scale features fusion. We apply the first 13 convolution layers of VGG-16 as the encoder module to extract the base features. Then, the base features are fed into the decoder module. The decoder module mainly contains Density Regression Branch (DRB) and Feature Fusion Branch (FFB). DRB uses multiple convolution layers for feature learning and multi-scale feature extraction. FFB uses attention modules for modeling multi-scale features and element-wise multiply for features fusion. Therefore, AAFM can obtain rich contextual information into the encoder-decoder framework for generating high-quality crowd density maps and accurate counting. We perform experiments on ShanghaiTech, UCF-CC-50, and UCF-QNRF datasets, and AAFM achieves promising results.
Highlights
Crowd counting is a fundamental and key problem in crowd analysis and scene understanding field
In order to scale variation of crowd counting, we propose an adaptive attention fusion mechanism (AAFM)
We present the results of the AAFM on crowd counting and crowd localization as follows
Summary
Crowd counting is a fundamental and key problem in crowd analysis and scene understanding field. The motivation of AAFM is fusing multi-scale features of neural networks and getting rich contextual information. The decoder module mainly contains the Density Regression Branch (DRB) and the Feature Fusion Branch (FFB) It can get rich contextual information through feature learning of convolution layers, and multi-scale features fusion. The DRB contains multiple 3 × 3 convolution layers and multiple upsampling layers It can retrieve the crowd density information by supervised learning. The FFB contains multiple attention modules, multiple 1 × 1 convolution layers, and upsampling layers It can obtain rich contextual information by multi-scale features fusion. 1) We propose an attentional fusion neural network (AAFM) for crowd counting FFB can fuse multi-scale features to obtain rich contextual information It can model heads region effectively and alleviate the counting mistaken by scale variation.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.