Abstract

High resolution and strong semantic representation are both vital for feature extraction networks of pedestrian detection. The existing high-resolution network (HRNet) has presented a promising performance for pedestrian detection. However, we observed that it still has some significant shortcomings for heavily occluded and small-scale pedestrians. In this paper, we propose to address the shortcomings by extracting semantic and spatial context from HRNet. Specifically, we propose a Context-aware Feature Representation Learning Module (CFRL-Module), which combines a Multi-scale Feature Context Extraction Parallel Block for Convolution and Self-attention (CEPCA-Block) with two parallel paths and an Equivalent FFN (EFFN) Block. The core CEPCA-Block adopts a parallel design to integrate convolution and multi-head self-attention (MHSA) with low parameter computational cost, which can obtain the deep semantic context by convolution path and precise context by MHSA path. Furthermore, to overcome the inefficiency of global MHSA in high-resolution pedestrian detection, we propose a novel local window MHSA, which can significantly reduce memory consumption but barely affect the detection performance. Cascading the proposed CFRL-Module with the anchor-free detection head constitutes our Context-aware Feature Representation Learning Anchor-Free Network (CFRLA-Net). The proposed CFRLA-Net can catch a high-level understanding of the heavily occluded and small-scale pedestrian instances based on HRNet, which can effectively solve the limitation of the insufficient feature extraction ability of HRNet for the hard samples. Experimental results show that CFRLA-Net achieves state-of-the-art performance on CityPersons, Caltech, and CrowdHuman benchmarks.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call