Abstract

ABSTRACT Video-based crowd counting and density estimation (CCDE) is vital for crowd monitoring. The existing solutions lack in addressing issues like cluttered background and scale variation in crowd videos. To this end, a multiscale head attention-guided multiscale density maps fusion for video-based CCDE via multi-attention Spatial-Temporal CNN (MHAMD-MST-CNN) is proposed. The MHAMD-MST-CNN has three modules: a multi attention spatial stream (MASS), a multi attention temporal stream (MATS), and a final density map generation (FDMG) module. The spatial head attention modules (SHAMs) and temporal head attention modules (THAMs) are designed to eliminate the background influence from the MASS and the MATS, respectively, by mapping the multiscale spatial or temporal features to head maps. The multiscale de-backgrounded features are utilised by the density map generation (DMG) modules to generate multiscale density maps to deal with scale variation due to perspective distortion. The multiscale density maps are fused and fed into the FDMG module to obtain the final crowd density map. The MHAMD-MST-CNN has been trained and validated on three publicly available benchmark datasets: the Venice, the Mall, and the UCSD. The MHAMD-MST-CNN provides competitive results as compared with the state-of-the-arts in terms of mean absolute error (MAE) and root mean squared error (RMSE).

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call