Abstract

Convolutional neural network (CNN) has achieved great success in many vision tasks. A key to this success is its ability to powerful automatically learns both high-level and low-level features. In general, low-level features have a small size of receptive fields and appear multiple times in different locations of objects, while high-level semantic features have a relatively large size of receptive fields and only appear once in a specific location of objects. However, traditional CNN treats these two kinds of features in the same manner, i.e., learning them by the convolution operation, which can be approximately considered as cumulating the probabilities that a feature appears in different locations. This strategy is reasonable for low-level features but not for high-level semantic ones, especially in the case of pedestrian detection, where a local feature can be shared by different locations but a semantic part, e.g., a head, only appears once for a human. To jointly model the spatial structure and appearance of high-level semantic features, we propose a new module to learn spatially weighted max pooling in CNN. The proposed method is evaluated on several pedestrian detection databases and the experimental results show that it achieves much better performance than traditional CNN.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call