Abstract

In this article, we propose a new cross locality relation network (CLRNet) to generate high-quality crowd density maps for crowd counting in videos. Specifically, a cross locality relation module (CLRM) is proposed to enhance feature representations by modeling local dependencies of pixels between adjacent frames with an adapted local self-attention mechanism. First, different from the existing methods which measure similarity between pixels by dot product, a new adaptive cosine similarity is advanced to measure the relationship between two positions. Second, the traditional self-attention modules usually integrate the reconstructed features with the same weights for all the positions. However, crowd movement and background changes in a video sequence are uneven in real-life applications. As a consequence, it is inappropriate to treat all the positions in reconstructed features equally. To address this issue, a scene consistency attention map (SCAM) is developed to make CLRM pay more attention to the positions with strong correlations in adjacent frames. Furthermore, CLRM is incorporated into the network in a coarse-to-fine way to further enhance the representational capability of features. Experimental results demonstrate the effectiveness of our proposed CLRNet in comparison to the state-of-the-art methods on four public video datasets. The codes are available at: https://github.com/Amelie01/CLRNet.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call