Abstract
The most advanced method for crowd counting uses a fully convolutional network that extracts image features and then generates a crowd density map. However, this process often encounters multiscale and contextual loss problems. To address these problems, we propose a multiscale aggregation network (MANet) that includes a feature extraction encoder (FEE) and a density map decoder (DMD). The FEE uses a cascaded scale pyramid network to extract multiscale features and obtains contextual features through dense connections. The DMD uses deconvolution and fusion operations to generate features containing detailed information. These features can be further converted into high-quality density maps to accurately calculate the number of people in a crowd. An empirical comparison using four mainstream datasets (ShanghaiTech, WorldExpo'10, UCF_CC_50, and SmartCity) shows that the proposed method is more effective in terms of the mean absolute error and mean squared error. The source code is available at https://github.com/lpfworld/MANet.
Highlights
Crowd counting technology is widely used in video surveillance, crowd management, traffic control, and other fields as well as at sporting events and political meetings [1, 2]
We propose a multiscale aggregation network (MANet) for crowd counting (Figure 2). e proposed MANet is an encoder-decoder network that uses a densely connected multiscale aggregation module in the encoder, referred to as a cascade scale pyramid network (CSPN). e CSPN contains four parallel dilated convolutions with different dilated rates for capturing the features of different receptive fields. e features obtained using the four dilated convolutions are further fused in a cascade manner to improve the ability of the network to handle multiscale features and anti-interference
Based on the existing literature, the evaluation metrics are the mean absolute error (MAE) and mean squared error (MSE), which can be used to evaluate the performance of crowd counting methods. e MAE indicates the accuracy of the count, and the MSE represents the robustness of the model. e MAE and MSE are calculated as follows: MAE
Summary
Crowd counting technology is widely used in video surveillance, crowd management, traffic control, and other fields as well as at sporting events and political meetings [1, 2]. Crowd counting methods can be extended to indirectly related fields, such as medical image analysis and animal group behavioral analysis [3]. Multicolumn architectures involve several columns of a convolutional neural network (CNN) with different receptive fields to accommodate multiscale crowds [4,5,6,7]. These methods have achieved good results, the multicolumn structure induces a considerable increase in parameters and computational costs. E goal of our architecture is to retain more multiscale contextual features. The similarity of column networks results in a high redundancy of learning features [8,9,10]. e goal of our architecture is to retain more multiscale contextual features. e proposed network comprises an encoder that can extract and retain the required features and a decoder that gradually recovers the image resolution and interprets the encoded features
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.