Abstract

Crowd counting is a challenging task due to large perspective, density, and scale variations. CNN-based crowd counting techniques have achieved significant performance in sparse to dense environments. However, crowd counting in high perspective-varying scenes (images) is getting harder due to different density levels occupied by the same number of pixels. In this way large variations for objects in the same spatial area make it difficult to count accurately. Further, existing CNN-based crowd counting methods are used to extract rich deep features; however, these features are used locally and disseminated while propagating through intermediate layers. This results in high counting errors, especially in dense and high perspective-variation scenes. Further, class-specific responses along channel dimensions are underestimated. To address these above mentioned issues, we therefore propose a CNN-based dense feature extraction network for accurate crowd counting. Our proposed model comprises three main modules: (1) backbone network, (2) dense feature extraction modules (DFEMs), and (3) channel attention module (CAM). The backbone network is used to obtain general features with strong transfer learning ability. The DFEM is composed of multiple sub-modules called dense stacked convolution modules (DSCMs), densely connected with each other. In this way features extracted from lower and middle-lower layers are propagated to higher layers through dense connections. In addition, combinations of task independent general features obtained by the former modules and task-specific features obtained by later ones are incorporated to obtain high counting accuracy in large perspective-varying scenes. Further, to exploit the class-specific response between background and foreground, CAM is incorporated at the end to obtain high-level features along channel dimensions for better counting accuracy. Moreover, we have evaluated the proposed method on three well known datasets: Shanghaitech (Part-A), Shanghaitech (Part-B), and Venice. The performance of the proposed technique justifies its relative effectiveness in terms of selected performance compared to state-of-the-art techniques.

Highlights

  • A crowd can be defined as a complex phenomenon due to constant interaction among people within the crowd, distinct behaviors of individuals, and inter-object occlusions

  • An effective combination of backbone network and dense feature extraction modules (DFEMs) plays a vital role in increasing the counting accuracy by aggregating detailed and semantic features

  • The proposed approach is composed of three modules: backbone network, DFEM, and channel attention module (CAM)

Read more

Summary

Introduction

A crowd can be defined as a complex phenomenon due to constant interaction among people within the crowd, distinct behaviors of individuals, and inter-object occlusions. Existing CNN-based crowd counting (CC) techniques enhance the counting accuracy by using well-known networks such as multi-column, multi-tasking, dilated, and de-convolutional [3] networks These networks have been widely used individually or in combination with each other to increase the performance at the cost of major shortcomings, such as large amounts of training time, ineffective branch structure, sparse pixel sampling rates, information loss, and extraction of irrelevant information. The authors of [6] proved through detailed experiment that multiple parallel columns [4,5] extract similar types of features irrespective of different kernel sizes Based on these observations, we propose an attention-based dense feature extraction network for single-image crowd counting (HADF-Crowd) method. The proposed approach is responsible for aggregating task-independent and taskspecific features at higher layers from lower and middle-lower layers, enhancing the estimation accuracy

Related Work
The Proposed Approach
Backbone Network
Task-Independent and Task-Specific Feature Acquisition
Implementation Details
Network Configuration
Training Details
Data Preparation
Performance Evaluation
Metrics Used for Evaluation
Venice Dataset
Ablation Study
Conclusions and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call