Abstract

We present a video representation learning framework for real-time video object detection. Our approach is based on the interesting observation that a powerful prior knowledge of video context helps to improve object recognition, and it can be acquired via learning video representations through stochastic video prediction. Our proposed framework utilizes the stochastic video prediction into object detection so that we first acquire a prior knowledge of videos to have video representations and then adjust them to object detection to improve the accuracy. We validate our proposed method on ImageNet VID and VisDrone-VID2019 datasets to demonstrate the effectiveness of video representation learning via future video prediction. In particular, our extensive experiments on ImageNet VID show that our approach achieves 73.1% mAP at 54 fps with ResNet-50 on commercial GPUs

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.