Abstract

The deep learning technique has recently led to significant improvement in object detection accuracy. In many applications, object detection is performed on video data consisting of a sequence of two-dimensional (2D) image frames. Numerous object detection schemes have been designed to detect objects independently in each video frame. Though temporal information within adjacent image frames can be exploited in subsequent object tracking stage, it has been shown that the object detection accuracy can be significantly improved by exploiting the temporal structure in the image sequence in the object detection stage. In this paper, we propose a novel video object detection method that exploits both the motion context inferred from the adjacent frames and the spatio-temporal features aggregated over the image sequence. First, correlation between the spatial feature maps over two adjacent frames are computed and the embedding vector, representing the motion context, is obtained by encoding the N correlation maps using long short term memory (LSTM). In addition to utilizing the motion context, the spatial feature maps for N+1 consecutive frames are aggregated to boost the quality of the feature map. The gated attention network is employed to selectively combine the temporal feature maps based on their relevance to the feature map in the present image frame. While most video object detectors have been developed for two-stage object detectors, our proposed idea applies to one-stage detectors with the advantage of low computational complexity in practical real-time applications. Our numerical evaluation conducted on the ImageNet object detection from video (VID) dataset demonstrates that our proposed network achieves significant performance gain over the baseline algorithms and outperforms the existing one-stage video object detectors.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.