Abstract

Common video-based object detectors exploit temporal contextual information to improve the performance of object detection. However, detecting objects under challenging conditions has not been thoroughly studied yet. In this paper, we focus on improving the detection performance for challenging events such as aspect ratio change, occlusion, or large motion. To this end, we propose a video object detection network using event-aware ConvLSTM and object relation networks. Our proposed event-aware ConvLSTM is able to highlight the area where those challenging events take place. Compared with traditional ConvLSTM, with the proposed method it is easier to exploit temporal contextual information to support video-based object detectors under challenging events. To further improve the detection performance, an object relation module using supporting frame selection is applied to enhance the pooled features for target ROI. It effectively selects the features of the same object from one of the reference frames rather than all of them. Experimental results on ImageNet VID dataset show that the proposed method achieves mAP of 81.0% without any post processing and can handle challenging events efficiently in video object detection.

Highlights

  • The introduction and popularization of convolutional neural networks (CNN) [1,2]have greatly improved the performance of still-image object detectors [3,4,5,6]

  • ImageNet VID dataset show that the proposed method achieves mean Average Precision (mAP) of 81.0% without any post processing and can handle challenging events efficiently in video object detection

  • We evaluate the performance using the mean Average Precision at the Intersection of Union (IoU) threshold of 0.5

Read more

Summary

Introduction

The introduction and popularization of convolutional neural networks (CNN) [1,2]have greatly improved the performance of still-image object detectors [3,4,5,6]. Earlier video object detectors such as [7] borrowed the ideas of still-image object detectors to deal with sequential information by using post-processing methods. One straightforward solution is to apply a recurrent neural network such as RNN [8], LSTM [9] or ConvLSTM [10]. They are typical networks for handling sequential data, including temporal information. Many recent video object detectors [11,12,13] apply recurrent neural networks to generate temporal contextual information, which is of great importance to video object detectors in numerous aspects

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call