Abstract

In this paper, we address the challenging problem of detecting pedestrians, which are heavily occluded and/or far from cameras. Unlike most existing pedestrian detection methods which only use coarse-resolution feature maps with fixed receptive fields, our approach exploits multi-grained deep features to make the detector robust to visible parts of occluded pedestrians and small-size targets. Specifically, we jointly train a multi-scale network and a human parsing network in a weakly supervised manner with only bounding box annotations. We carefully design the multi-scale network to predict pedestrians of particular scales with the most appropriate feature maps, by matching their receptive fields with the target sizes. The human parsing network generates a fine-grained attention map, which helps guide the detector to focus on the visible parts of occluded pedestrians and small-size instances. Both networks are computed in parallel and form a unified single stage pedestrian detector, which assures a suitable tradeoff between accuracy and speed. Moreover, we introduce an adversarial hiding network to make our detector more robust to occlusion situations, which generates occlusions on pedestrians with the goal to fool the detector that in turn adapts itself to learn to localize these adversarial instances. Experiments on three challenging pedestrian detection benchmarks show that our proposed method achieves a state-of-the-art performance and executes $2\times $ faster than the competitive methods.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call