Abstract. Crowd counting, a critical component in the management and safety planning of large gatherings and public spaces, is essential for ensuring smooth event operations and preventing potential overcrowding issues. While the standard convolutional neural network (CNN) based model performs well in head counting tasks, it has certain drawbacks when applied to complex scenarios. With the rapid development of artificial intelligence, Transformer models that rely on self-attention mechanisms, as Swin Transformer, have demonstrated exceptional performance in visual tasks, such as image classification, and segmentation in recent times. This study examines the experimental findings of Swin Transformer's head counting tasks and contrasts them with the CNN-based model. Mean Absolute Error (MAE) and Mean Square Error (MSE) evaluation indicators show that the Transformer model outperforms the classic CNN model in terms of generalization abilities when dealing with complicated scenarios. Future research work will increase the diversity of data sets and focus on optimizing model structure and improving training efficiency.
Read full abstract