Abstract

Recently, crowd counting has attracted significant attention, particularly in the context of the COVID-19 pandemic, due to its ability to automatically provide accurate crowd numbers in images. To address the challenges of location-level labeling, several transformer-based crowd counting methods have been proposed with only count-level supervision. However, these methods directly use the transformer as an encoder without considering the uneven crowd distribution. To address this issue, we propose CCTwins, a novel transformer-based crowd counting method with only count-level supervision. Specifically, we introduce an adaptive scene consistency attention mechanism to enhance the transformer-based model Twins-SVT-L for feature extraction in crowded scenes. Additionally, we design a multi-level weakly-supervised loss function that generates estimated crowd numbers in a coarse-to-fine manner, making it more appropriate for weakly-supervised settings. Moreover, intermediate features supervised by count-level labels are utilized to fuse multi-scale features. Experimental results on four public datasets demonstrate that our proposed method outperforms the state-of-the-art weakly-supervised methods, achieving up to a 16.6% improvement in MAE and up to a 13.8% improvement in RMSE across all evaluation settings. Moreover, the proposed CCTwins obtains competitive counting performance, even when compared to the state-of-the-art fully-supervised methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call