Abstract

Motion information can be important for detecting objects, but it has been used less for pedestrian detection, particularly with deep-learning-based methods. We propose a method that uses deep motion features as well as deep still-image features, following the success of two-stream convolutional networks, each of which are trained separately for spatial and temporal streams. To extract motion clues for detection differentiated from other background motions, the temporal stream takes as input the difference in frames that are weakly stabilized by optical flow. To make the networks applicable to bounding-box-level detection, the mid-level features are concatenated and combined with a sliding-window detector. We also introduce transfer learning from multiple sources in the two-stream networks, which can transfer still image and motion features from ImageNet and an action recognition dataset respectively, to overcome the insufficiency of training data for convolutional neural networks in pedestrian datasets. We conducted an evaluation on two popular large-scale pedestrian benchmarks, namely the Caltech Pedestrian Detection Benchmark and Daimler Mono Pedestrian Detection Benchmark. We observed 10% improvement compared to the same method but without motion features.

Highlights

  • Pedestrian detection is a long-standing challenge in the image recognition field, and its applications are diverse, e.g., in surveillance, traffic security, automatic driving, robotics, and human-computer interaction

  • Based on the findings in hand-crafted motion features, we demonstrate that deep learning over SDt [7] efficiently models the contours of moving objects in fine scale without unwanted motion edges

  • SDt, an effective motion feature for pedestrian detection, is used as inputs for temporal ConvNets instead of raw optical flow, as it factors out camera- and object-centric motions that are prominent in videos from car-mounted cameras

Read more

Summary

Introduction

Pedestrian detection is a long-standing challenge in the image recognition field, and its applications are diverse, e.g., in surveillance, traffic security, automatic driving, robotics, and human-computer interaction. We present a deep learning method for pedestrian detection that can exploit both spatial and temporal information with two-stream ConvNets. SDt, an effective motion feature for pedestrian detection, is used as inputs for temporal ConvNets instead of raw optical flow, as it factors out camera- and object-centric motions that are prominent in videos from car-mounted cameras. The accuracy on the validation set is 56.5% after training, which outperforms AlexNet on the raw RGB values of each frame (43.3%) and underperforms the temporal stream on the optical flow reported by Simonyan and Zisserman [15] This is not a problem because our purpose was not to achieve the best performance on the activity dataset but to acquire effective features for temporal difference images.

Pre-processing for the temporal stream
Method
Findings
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call