Real-Time Spatial-Temporal Depth Separable CNN for Multi-Functional Crowd Analysis in Videos

Poonkuntran Shanmugam,Santosh Kumar Tripathy

doi:10.1142/s0219467825500470

Abstract

Crowd behavior prediction (CBP) and crowd counting (CC) are the essential functions of vision-based crowd analysis (CA), which play a crucial role in controlling crowd disasters. The CA using different models for the CBP and the CC will increase computational overheads and have synchronization issues. The state-of-the-art approaches utilized deep convolutional architectures to exploit spatial-temporal features to accomplish the objective, but such models suffer from computational complexities during convolution operations. Thus, to sort out the issues as mentioned earlier, this paper develops a single deep model which performs two functionalities of CA: CBP and CC. The proposed model uses multilayers of depth-wise separable CNN (DSCNN) to extract fine-grained spatial-temporal features from the scene. The DSCNN can minimize the number of matrix multiplications during convolution operation compared to traditional CNN. Further, the existing datasets are available to accomplish the single functionality of CA. In contrast, the proposed model needs a dual-tasking CA dataset which should provide the ground-truth labels for CBP and CC. Thus, a dual functionality CA dataset is prepared using a benchmark crowd behavior dataset, i.e. MED. Around 41[Formula: see text]000 frames have been manually annotated to obtain ground-truth crowd count values. This paper also demonstrates an experiment on the proposed multi-functional dataset and outperforms the state-of-the-art methods regarding several performance metrics. In addition, the proposed model processes each test frame at 3.40 milliseconds, and thus is easily applicable in real-time.

Full Text