Abstract

The CNN-based crowd counting method uses image pyramid and dense connection to fuse features to solve the problems of multiscale and information loss. However, these operations lead to information redundancy and confusion between crowd and background information. In this paper, we propose a multi-scale guided attention network (MGANet) to solve the above problems. Specifically, the multilayer features of the network are fused by a top-down approach to obtain multiscale information and context information. The attention mechanism is used to guide the acquired features of each layer in space and channel so that the network pays more attention to the crowd in the image, ignores irrelevant information, and further integrates to obtain the final high-quality density map. Besides, we propose a counting loss function combining SSIM Loss, MAE Loss, and MSE Loss to achieve effective network convergence. We experiment on four major datasets and obtain good results. The effectiveness of the network modules is proved by the corresponding ablation experiments. The source code is available at https://github.com/lpfworld/MGANet.

Highlights

  • Crowd counting can count the number of people in images or video frames to realize the effective management of different scenes such as meetings and sports events

  • The evaluation metrics are the mean absolute error (MAE) and mean squared error (MSE), which can evaluate the performance of our method

  • MAE indicates the accuracy of the model counting, and MSE represents the robustness of the model. e formulae are as follows: 4.4.2

Read more

Summary

Introduction

Crowd counting can count the number of people in images or video frames to realize the effective management of different scenes such as meetings and sports events. Crowd counting can be used to count the number of cells, viruses, and animals, extending the field of research into medical and behavioral science [6] It is still a challenging task in the field of computer vision due to the problems of crowd occlusion, scale variation, uneven data distribution, and so forth (see Figure 1). Other researchers used parallel convolution kernels to obtain feature maps with different scales or fused multiscale information by the dense connection of multilayer features [11, 12]. In these similar structures, the features learned from different branches have greater repeatability, which makes little contribution to the extraction of multiscale information. Some networks are tried to use spatial attention in the training process to emphasize the crowd in images, to solve the problems such as background interference [13, 14]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.