With the end of the COVID-19 pandemic, the number of pedestrians in various public places has increased dramatically. Estimating the size and density distribution of crowds accurately from images is essential for public safety. At present, there are still many factors that limit the accuracy of dense crowd counting, such as perspective distortion, background clutter and heavy occlusion. To be capable of accurately estimating the crowd size in RGB images, we propose a crowd counting network called CrowdUNet, which is assisted by a segmentation task. It applies the segmentation results to the crowd counts, making the network more focused on the prediction of foreground regions. We combine Swin Transformer Block and CNN to build a Swin Transformer Convolution(STC) module to extract deep semantic features. We analyze the characteristics of crowd images and propose a novel decoder structure called Coordinate Decoder(CD), which better aggregates low and high level features and improve the robustness of the network. In order to obtain accurate regression results, we also propose a regression head with multi-scale receptive fields, which is called Spatial Pyramid Convolution (SPC). Extensive experiments on four challenging crowd counting datasets namely ShanghaiTech A, ShanghaiTech B, UCF p=CC 50, and UCF-QNRF have validated the proposed method.
Read full abstract