Abstract

Crowd density estimation is a task of intelligent applications, and its operation efficiency is very important. However, to obtain a better density estimation performance, most of the existing works often design larger and more complex network structures, which will result in them occupying considerable memory, time and other resources at runtime, and require the support of high-performance hardware platforms, which are difficult to apply in practice. In this paper, to overcome the above problems, we propose a lightweight dense crowd estimation method based on channel attention multi-scale feature fusion. Specifically, in the process of feature extraction, an efficient and lightweight convolution module (L-weight) is designed to extract crowd features in stages, which reduces the amount of network parameters and computing costs, and we capture multi-scale crowd information through the feature extraction network of pyramid structure, which solves the problem of uneven crowd scale in video images. In the process of feature fusion, a channel attention fusion module is designed, which weights and fuses the feature information of different scales, effectively fuses multi-scale information and suppresses useless information. In addition, we design a new loss function, which enhances the sensitivity of the crowd through the pixel space loss (L2), counting loss (LC) and structural similarity loss (LS), to ensure the counting accuracy. Extensive experiments on four mainstream datasets demonstrate that compared with other state-of-the-art methods, our method achieves an optimal trade-off between counting performance and running speed, and is suitable for low-performance computing platforms such as embedded.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call