Abstract

Video object detection still faces several difficulties and challenges. For example, the imbalance of positive and negative samples leads to low information processing efficiency, and detection performance declines in abnormal situations in video. This paper examines video object detection based on local attention to address such challenges. We propose a local attention sequence model and optimized the parameter and calculation of ConvGRU. It could process spatial and temporal information in videos more efficiently and ultimately improve detection performance under abnormal conditions. The experiments on ImageNet VID show that our method could improve the detection accuracy by 5.3%, and the visualization results show that the method is adaptive to different abnormal conditions, thereby improving the reliability of video object detection.

Highlights

  • Object detection is a fundamental problem in computer vision and has been widely used in the fields of surveillance, robots, medical intelligence, etc

  • Instead of relying on optical flow, we propose an innovative video object detection model based on local attention

  • It can be seen from the comparison of the results that the video object detection model after the introduction of the local attention sequence model can better solve the difficult detection problems caused by the occlusion of the object movement process in the video, posture transformation, and the blurring caused by camera movement

Read more

Summary

Introduction

Object detection is a fundamental problem in computer vision and has been widely used in the fields of surveillance, robots, medical intelligence, etc. Redmon proposed the YOLO [5] detection framework in 2015, which rasterizes images and predicts the object category and bounding box for each grid at the same time Applying such image-based object detectors to the domain of videos, is often unsatisfactory due to the deteriorated appearance caused by issues such as motion blur, out-of-focus camera, and rare poses frequently encountered in videos. Existing methods that leverage temporal information for object detection from videos usually use optical flow to propagate high-level features across frames. Our contributions are as follows: We introduce a novel video object detector based on local attention to establish the spatial and temporal correspondence across frames without extra optical flow models.

Video Object Detection
Self-Attention
Spatial Attention
Results
Ablation Study
Conclusion
Conclusions
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call