Abstract
With the end of the COVID-19 pandemic, the number of pedestrians in various public places has increased dramatically. Estimating the size and density distribution of crowds accurately from images is essential for public safety. At present, there are still many factors that limit the accuracy of dense crowd counting, such as perspective distortion, background clutter and heavy occlusion. To be capable of accurately estimating the crowd size in RGB images, we propose a crowd counting network called CrowdUNet, which is assisted by a segmentation task. It applies the segmentation results to the crowd counts, making the network more focused on the prediction of foreground regions. We combine Swin Transformer Block and CNN to build a Swin Transformer Convolution(STC) module to extract deep semantic features. We analyze the characteristics of crowd images and propose a novel decoder structure called Coordinate Decoder(CD), which better aggregates low and high level features and improve the robustness of the network. In order to obtain accurate regression results, we also propose a regression head with multi-scale receptive fields, which is called Spatial Pyramid Convolution (SPC). Extensive experiments on four challenging crowd counting datasets namely ShanghaiTech A, ShanghaiTech B, UCF p=CC 50, and UCF-QNRF have validated the proposed method.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.