Abstract
Crowd counting is getting more and more attention in our daily life, because it can effectively prevent some safety problems. However, due to scale variations and background noise in the image, such as buildings and trees, getting the accurate number from image is a hard work. In order to address these problems, this work introduces a new multi-scale supervised network. The proposed model uses part of vgg16 model as the backbone to extract feature. In the training process, a multi-scale dilated convolution module is added at the end of each stage of the backbone network to generate attention map with different resolutions to help the model focus on the head area in feature map. In addition, the dilated convolution adopts three dilation ratios to fit different sizes of head in the image. Finally, in order to get the high-quality density map with high-resolution, the authors employ the upsampling operation to restore the density map size to the quarter size of original image. A large number of experiments on these four datasets show that the proposed network has greatly improved the counting accuracy of many existing methods.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.