Abstract

Huge variations in the scales of people in images create an extremely challenging problem in the task of crowd counting. Currently, many researchers apply multi-column structures to solve the scale variation problem. However, multi-column structures usually have complex structures with large numbers of parameters and are difficult to optimize. To this end, we propose a scale-aware representation learning network (SRNet) that uses a commonly used encoder-decoder framework. An image is converted into deep features by the first ten layers of VGG16 in the encoder. Then, the features are regressed to a crowd density map via the decoder. The decoder mainly consists of two modules: the scale-aware feature learning module (SAM) and the pixel-aware upsampling module (PAM). SAM models the multi-scale features of a crowd at each level with different sizes of receptive fields, and PAM enlarges the spatial resolution and enhances the pixel-level semantic information, thereby improving the overall counting accuracy. We conduct extensive crowd counting experiments on ShanghaiTech Part_A, UCF-QNRF, and UCF_CC_50 datasets. Furthermore, to obtain the locations of each person, we conduct crowd localization experiments on UCF-QNRF and NWPU-Crowd datasets. The qualitative and quantitative results prove the effectiveness of the SRNet in dense crowd counting and crowd localization tasks.

Highlights

  • C ROWD counting is a classic computer vision task that aims to automatically count the number of people from a given image

  • Different from the above methods, we propose a scale-aware representation learning network and construct a scale-aware feature learning module, which transfers the extracted feature information between layers and refines it and avoids the large increase in the number of parameters caused by the increasing complexity of the network

  • SCALE-AWARE REPRESENTATION LEARNING NETWORK We present a schematic diagram of the model in Fig. 1 and call this model scale-aware representation learning network (SRNet), which is used for dense crowd counting

Read more

Summary

INTRODUCTION

C ROWD counting is a classic computer vision task that aims to automatically count the number of people from a given image. Other methods [5], [6] have been proposed to improve the accuracy by manually extracting features and using a regressor to directly regress the number of people. Most of the networks, such as MCNN [7], Switching-CNN [8], CAN [10], and CP-CNN [16], aim to model multi-scale variations in crowds Some of these methods have redundant features [7], require complex structures to train multiple regressors [8], [16] or include a large number of parameters [10]. SAM models the multi-scale features of the crowd at each level with different sizes of receptive fields, and PAM improves the resolution and enhances the pixellevel semantic information, thereby improving the overall counting accuracy.

RELATED WORK
MULTI-SCALE FEATURE LEARNING FOR OTHER COMPUTER VISION TASKS
PROPOSED METHOD
PIXEL-AWARE UPSAMPLING MODULE
LOCALIZATION TASK
LOSS FUNCTION The L2 loss is chosen as the crowd counting task loss function
EXPERIMENTS
IMPLEMENTATION DETAILS
Method
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call