Abstract

Crowd counting has become a noteworthy vision task due to the needs of numerous practical applications, but it remains challenging. State-of-the-art methods generally estimate the density map of the crowd image with the high-level semantic features of various deep convolutional networks. However, the absence of low-level spatial information may result in counting errors in the local details of the density map. To this end, a novel framework named Multi-level Feature Fusion Network (MFFN) for single image crowd counting is proposed. The proposed MFFN, which is constructed in an encoder–decoder fashion, incorporates semantic and spatial information for generating high-resolution density maps of input crowd images. Skip connections are developed between the encoder and the decoder so that low-level spatial information and high-level semantic features can be combined by element-wise addition. In addition, a dense dilated convolution block is placed behind the encoder, extracting multi-scale context features to guide feature fusion by a channel attention mechanism. The model is trained by multi-task learning; semantic segmentation supervision is introduced to enhance feature representation. Extensive experiments are conducted on three crowd counting datasets (ShanghaiTech, UCF_CC_50, UCF-QNRF), and the results show that MFFN outperforms state-of-the-art methods. In addition, sufficient ablation studies are performed to verify the effectiveness of each component in our proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call