Multi-level learning counting via pyramid vision transformer and CNN

Jiayu Liu,He Li,Weihang Kong

doi:10.1016/j.engappai.2023.106184

Abstract

Severe scale variation has become a challenging issue for hindering the improvement of accuracy in crowd counting task. To tackle the problem, we propose a Pyramid Transformer CNN Network (PTCNet), an effective combination of the transformer and the CNN, which possesses both the global receptive fields and the locality to deal with the severe scale variation problems and boost the prediction accuracy. Firstly, we utilize the pyramid vision transformer to extract multi-level global context information of the crowd, aiming at different head scales. And then, the multi-level information is fully fused in the multi-level feature aggregation module where detailed crowd characteristics from all feature spaces are preserved to be further processed. Finally, we design a multi-branch regression head to enrich the crowd features for strong representations and regress the density maps. Extensive experiments on challenging datasets with complex scenarios and multiple scales demonstrate the effectiveness of the our method. The proposed method achieves competitive results comparing with the state-of-the-art approaches and achieves state-of-the-art results(MAE:51.7, RMSE:79.6) on ShanghaiTech Part_A dataset.

Full Text