Abstract

Most existing video-based object detection methods utilize successful image-based object detector as a base network, and additionally exploit temporal information with either bounding-box post-processing or feature enhancement from multiple frames. However, little work has been done on directly modeling temporal motion in an efficient way for detection in surveillance videos. In this paper, a simple but effective module, denoted as motion-from-memory (MFM), is proposed to encode temporal context for improved detection in surveillance videos. With appearance features extracted from a base CNN, the MFM module maintains a dynamic memory for each input sequence and output motion features on each frame. This module costs minor additional model parameters and computations, but is very helpful for moving object detection, especially in surveillance videos. Thanks to the additional MFM module, the performance of a light-weight MobileNet-based Faster RCNN detector is boosted by 13.93% in mAP, achieving comparable performance to that of strong ResNet-50-based. When MFM is integrated into an even weaker but faster single-stage detector, it ranks the second best one among all published works when submitted to the DEETRAC vehicle detection benchmark, with 69.10% mAP, compared to 69.87% of the best one. However, when running speed is considered, the proposed method is the fastest one, running at 33 FPS with 540×960 surveillance videos on a moderate commercial GPU (NVIDIA GTX 1080Ti), which is about 3 times faster than the second fastest one.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call