During the past five years, there has been an increasing trend of weakly supervised crowd counting methods being developed since such methods just rely on count-level annotations and avoid a laborious labeling process. But, the existing weakly supervised methods usually fail to achieve comparable counting performance to the fully supervised methods. To improve the accuracy of crowd counting tasks, we propose to combine the convolutional neural network (CNN) and Transformer frameworks. Since CNN focuses on capturing local detail information and Transformer can effectively extract global context information, we believe that the combination of CNN and Transformer could learn more efficient feature representations for crowd images. Our proposed framework is named CrowdCCT (Crowd Counting via CNN and Transformer), and it is composed of a CNN feature extraction part, a Transformer feature extraction part, and a counting regression part. In the CNN part, we utilize DenseNet121 to learn rich semantic features with its inherent dense connection structure. In the Transformer part, we introduce two attention modules, Multi-Scale Dilated Attention (MSDA) and Location-Enhanced Attention (LEA), working together to extract more expressive features. The output features are then fed into the regression part to generate the predicted counting results. Experiments on four crowd counting benchmark datasets demonstrate that our proposed CrowdCCT can achieve superior performance. Also, the experimental results validate the feasibility and effectiveness of combining CNN and Transformer for weakly supervised counting tasks. Our work could be expected to promote further combination research on CNN and Transformer.
Read full abstract