Abstract

Weakly-supervised crowd counting methods using only the count-level label have achieved great progress recently. However, the count-level label can not provide information about the distribution of the crowd in the scene, which will affect the accuracy of the final count result. Therefore, we design a density token to perceive the crowd distribution in the scenes. Based on this, we propose a Dual Supervision Transformer (DSFormer) to perform weakly-supervised crowd counting in the double supervision of the total count. Specifically, the encoded features in vision transformer are sent to the proposed locality enhanced module (LEM), and one branch of them with density tokens are sent into decoder for interaction. The crowd distribution perception is then realized through cross self-attention, where the other branch of encoded features are used as queries. Finally, the output features of the decoder are respectively fed into a count regression head and a crowd density classification head to obtain the crowd count and the crowd density classification. A series of experiments are conducted on three commonly used crowd counting datasets, both the quantitative and vi-sualization results illustrate the effectiveness of DSFormer. Our model achieve a superior result among weakly supervised crowd counting methods, and code is available at: https://github.com/ZaiyiHu/DSFormer.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call