Crowd counting is significant in many areas. The existing methods have poor accuracy for perspective scenes and low illumination scenes. Besides, the existing audio-assisted methods only use local audio, which fails to provide the spatial feature information of sound in all directions in space. To alleviate the above problems, a novel framework named Video and Audio-assisted Crowd Counting Network (VACCNet) is proposed. The framework consists of two submodules: Video Crowd Counting (VCC) module and Audio-assisted Crowd Counting (ACC) module. The visual features from the VCC module and the fused audio features from the ACC module are further combined to get the final density map. To prove the effects of VACCNet, a new self-collected dataset named multiPle dIrection Assistance couNting netwOrk (PIANO) is built. The experimental results based on existing benchmarks and PIANO show that the proposed method has a 14.23% improvement averagely to the conventional methods.
Read full abstract