A Multi-Scale Feature Fusion Network With Cascaded Supervision for Cross-Scene Crowd Counting

Xinfeng Zhang,Xiaohu Wang,Wencong Shan,Congcong Zhu,Bin Li,Shuhan Chen,Lina Han

doi:10.1109/tim.2023.3246534

Abstract

Counting the number of people in public places has received much attention, and researchers have devoted much effort to the task. However, the existing crowd counting approaches are mainly trained and tested in similar scenarios. The performance of crowd counting approaches degrades sharply when the test scenarios of the models are of different types from its training scenes. In practice, the crowd scenes are highly variable, and the lack of cross-scene capability could seriously limit the application of the existing approaches. We attribute the improvement in cross-scene crowd counting capability to the necessity of accommodating large changes in the scale of individuals and the ability to suppress the interference of cluttered backgrounds. To this end, we propose a multi-scale feature fusion network (MFFNet) with cascaded supervision. The multi-scale features extracted from the crowd images are upsampled and then combined into several feature blocks, followed by convolution and deconvolution operations on the feature blocks to derive feature matrices of different resolutions. The feature matrices are fused from bottom to top. In the process of feature fusion, the crowd density maps corresponding to the feature matrices of different resolutions are predicted separately. We devise cascaded supervision to synchronously optimize the network of different resolution density map prediction during training. The cross-scene crowd counting experiments are conducted on four types of scenes: SHT A with high-density crowd scenes and small-scale individuals, SHT B with sparse crowd distribution and medium-scale individuals, UCF_CC_50 dataset with extremely dense scenes and tiny scale individuals, and UCF-QNRF dataset with extreme variations. MFFNet exhibits the strongest scene adaptability relative to the state-of-the-art approaches, with an average decrease of 17.1% and 8.4% in MAE and MSE, respectively. The contributions of different components in our method are verified in the ablation study using the devised evaluation metrics. Our implementation will be available at https://github.com/learnsharing/MFFNet.

Full Text