Abstract

One of the major challenges in video object detection is drastic scale changes of objects due to camera motion. In this paper, we propose a two-path Convolutional Long Short-Term Memory (convLSTM) pyramid network designed to extract and convey multi-scale temporal contextual information in order to handle object scale changes efficiently. The proposed two-path convLSTM pyramid consists of a stack of multi-input convLSTM modules. It is updated in top-down and bottom-up pathways so that the temporal contextual information for small-to-large and large-to-small scale changes is exploited. The proposed multi-input convLSTM module uses two input feature maps of different resolutions to store and exchange temporal contextual information of different scales between neighboring convLSTM modules. The outputs of the proposed convLSTM pyramid network constitute a feature pyramid where each feature map contains multi-scale temporal contextual information from earlier frames. The proposed convLSTM pyramid can be combined with various still-image object detectors to improve the performance of video object detection. Experimental results on ImageNet VID dataset show that the proposed method achieves state-of-the-art performance and can handle scale changes efficiently in video object detection.

Highlights

  • S INCE the introduction of convolutional neural networks (CNNs) for image classification [1], significant improvement has been achieved in still-image object detection

  • (2) We propose a two-path convLSTM pyramid which consists of a stack of multi-path convLSTM modules to extract and pass multi-scale temporal contextual information in videos

  • After the bottom-up update, four outputs of the multi-input convLSTMs form a feature pyramid that contains both the multiresolution features from the current frame and the multi-scale temporal contextual information from the previous frames

Read more

Summary

Introduction

S INCE the introduction of convolutional neural networks (CNNs) for image classification [1], significant improvement has been achieved in still-image object detection. The outputs of the proposed two-path convLSTM pyramid network constitute a feature pyramid where each feature map contains multi-scale temporal contextual information from earlier frames. To the best of our knowledge, our work is the first approach that introduces the connections between convLSTMs at different levels of the pyramid and exploits temporal contextual information for small-to-large and large-to-small scales changes in video object detection.

Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call